Explore how Mixture of Depths (MoD) optimizes AI efficiency by dynamically routing tokens. Learn how this technique reduces FLOPs in Ultralytics YOLO26 and LLMs.
In deep learning architectures, computational efficiency is paramount, especially when processing long sequences or high-resolution inputs. A novel approach dynamically allocates compute resources by allowing the network to decide which parts of the input require full processing and which can safely bypass certain layers. This dynamic routing strategy reduces overall computational complexity without sacrificing the model's predictive power or accuracy.
Mixture of Depths (MoD) is an architectural technique primarily applied to Transformer architectures where the model learns to dynamically skip computation for specific tokens at various layers. Traditional transformers process every token through every layer, whether it is a crucial piece of information or filler content. In contrast, an MoD model uses a router mechanism to evaluate tokens and assigns them a score. Only the top-scoring tokens—up to a predefined capacity limit—are passed through the heavy computation blocks, such as attention mechanisms or dense feed-forward layers. The remaining tokens bypass the block via residual connections, effectively creating a "mixture of depths" where different tokens experience varying levels of processing depth.
This method, popularized by recent DeepMind research and documented extensively in the arXiv repository, drastically reduces the total number of floating-point operations (FLOPs) required during both training and inference.
It is easy to confuse this concept with a Mixture of Experts (MoE). While both use routing mechanisms, they solve different problems:
The ability to dynamically budget compute makes this technique highly valuable across multiple domains of computer vision and natural language processing.
Below is a conceptual PyTorch snippet demonstrating how a basic routing mechanism might skip computation for a portion of input tokens, simulating a depth-routing behavior.
import torch
import torch.nn as nn
class MixtureOfDepthsBlock(nn.Module):
def __init__(self, d_model, capacity_factor=0.5):
super().__init__()
self.capacity_factor = capacity_factor
self.router = nn.Linear(d_model, 1)
self.heavy_compute = nn.Sequential(nn.Linear(d_model, d_model * 4), nn.GELU(), nn.Linear(d_model * 4, d_model))
def forward(self, x):
# x shape: (batch_size, seq_len, d_model)
seq_len = x.size(1)
capacity = int(seq_len * self.capacity_factor)
# 1. Compute routing scores
scores = self.router(x).squeeze(-1) # Shape: (batch_size, seq_len)
# 2. Identify top-k tokens to process
topk_indices = torch.topk(scores, capacity, dim=1).indices
# 3. Create an output tensor mirroring the input (residual baseline)
output = x.clone()
# 4. Apply heavy computation only to the selected tokens
for b in range(x.size(0)):
selected_tokens = x[b, topk_indices[b]]
processed_tokens = self.heavy_compute(selected_tokens)
output[b, topk_indices[b]] += processed_tokens
return output
# Example usage
dummy_input = torch.randn(2, 64, 128) # Batch=2, Seq=64, Dim=128
mod_block = MixtureOfDepthsBlock(d_model=128, capacity_factor=0.5)
output = mod_block(dummy_input)
print(f"Output shape: {output.shape}") # Expect (2, 64, 128)
By leveraging frameworks like the PyTorch framework or TensorFlow, developers can integrate these custom model optimization blocks. Furthermore, tools like the Ultralytics Platform help teams manage the training data required to accurately train these routers, alongside integrating seamlessly with enterprise ecosystems like Google Cloud AI.