Mixture of Depths (MoD)

Explore how Mixture of Depths (MoD) optimizes AI efficiency by dynamically routing tokens. Learn how this technique reduces FLOPs in Ultralytics YOLO26 and LLMs.

In deep learning architectures, computational efficiency is paramount, especially when processing long sequences or high-resolution inputs. A novel approach dynamically allocates compute resources by allowing the network to decide which parts of the input require full processing and which can safely bypass certain layers. This dynamic routing strategy reduces overall computational complexity without sacrificing the model's predictive power or accuracy.

Link to this sectionUnderstanding the Concept#

Mixture of Depths (MoD) is an architectural technique primarily applied to Transformer architectures where the model learns to dynamically skip computation for specific tokens at various layers. Traditional transformers process every token through every layer, whether it is a crucial piece of information or filler content. In contrast, an MoD model uses a router mechanism to evaluate tokens and assigns them a score. Only the top-scoring tokens—up to a predefined capacity limit—are passed through the heavy computation blocks, such as attention mechanisms or dense feed-forward layers. The remaining tokens bypass the block via residual connections, effectively creating a "mixture of depths" where different tokens experience varying levels of processing depth.

This method, popularized by recent DeepMind research and documented extensively in the arXiv repository, drastically reduces the total number of floating-point operations (FLOPs) required during both training and inference.

Link to this sectionDifferentiating From Mixture of Experts (MoE)#

It is easy to confuse this concept with a Mixture of Experts (MoE). While both use routing mechanisms, they solve different problems:

MoE routes tokens to different sub-networks (experts) within a layer. The computational depth remains the same for all tokens, but the model's parameter count increases.
MoD routes tokens to either the computation block or a skip connection. The parameter count remains strictly constant, but the computational depth decreases for less important tokens, directly improving inference latency.

Link to this sectionReal-World Applications#

The ability to dynamically budget compute makes this technique highly valuable across multiple domains of computer vision and natural language processing.

Context Optimization in Language Models: Modern Large Language Models (LLMs) from organizations like OpenAI and Anthropic process massive context windows. By employing dynamic depth routing, these models can skip structural or repetitive filler words, reserving deep computation for complex reasoning steps and factual extraction.
High-Resolution Vision AI: In advanced vision systems like the Ultralytics YOLO26 model, processing large images for object detection and image segmentation requires immense memory. Depth routing allows the network to bypass feature extraction on uniform backgrounds (like empty skies or blank walls), focusing computational power on intricate foreground objects. This is crucial for deploying models to resource-constrained edge AI hardware optimized by CUDA optimization libraries.

Link to this sectionImplementation Example#

Below is a conceptual PyTorch snippet demonstrating how a basic routing mechanism might skip computation for a portion of input tokens, simulating a depth-routing behavior.

import torch
import torch.nn as nn


class MixtureOfDepthsBlock(nn.Module):
    def __init__(self, d_model, capacity_factor=0.5):
        super().__init__()
        self.capacity_factor = capacity_factor
        self.router = nn.Linear(d_model, 1)
        self.heavy_compute = nn.Sequential(nn.Linear(d_model, d_model * 4), nn.GELU(), nn.Linear(d_model * 4, d_model))

    def forward(self, x):
        # x shape: (batch_size, seq_len, d_model)
        seq_len = x.size(1)
        capacity = int(seq_len * self.capacity_factor)

        # 1. Compute routing scores
        scores = self.router(x).squeeze(-1)  # Shape: (batch_size, seq_len)

        # 2. Identify top-k tokens to process
        topk_indices = torch.topk(scores, capacity, dim=1).indices

        # 3. Create an output tensor mirroring the input (residual baseline)
        output = x.clone()

        # 4. Apply heavy computation only to the selected tokens
        for b in range(x.size(0)):
            selected_tokens = x[b, topk_indices[b]]
            processed_tokens = self.heavy_compute(selected_tokens)
            output[b, topk_indices[b]] += processed_tokens

        return output


# Example usage
dummy_input = torch.randn(2, 64, 128)  # Batch=2, Seq=64, Dim=128
mod_block = MixtureOfDepthsBlock(d_model=128, capacity_factor=0.5)
output = mod_block(dummy_input)
print(f"Output shape: {output.shape}")  # Expect (2, 64, 128)

By leveraging frameworks like the PyTorch framework or TensorFlow, developers can integrate these custom model optimization blocks. Furthermore, tools like the Ultralytics Platform help teams manage the training data required to accurately train these routers, alongside integrating seamlessly with enterprise ecosystems like Google Cloud AI.

Explore solutions

AI in Robotics

Power smarter machines with Ultralytics YOLO models. Vision AI in robotics drives autonomous navigation, perception, object tracking, and real-time control.

Mixture of Depths (MoD)

Link to this sectionUnderstanding the Concept#

Link to this sectionDifferentiating From Mixture of Experts (MoE)#

Link to this sectionReal-World Applications#

Link to this sectionImplementation Example#

Explore solutions

AI in Robotics

AI in Logistics

AI in Retail

AI in Healthcare

AI in Manufacturing

AI in Automotive

AI in Agriculture

AI in Robotics

AI in Logistics

AI in Retail

AI in Healthcare

AI in Manufacturing

AI in Automotive

AI in Agriculture

AI in Robotics

AI in Logistics

AI in Retail

AI in Healthcare

AI in Manufacturing

AI in Automotive

AI in Agriculture

Let's build the future of AI together!