Scaling Laws
Explore neural scaling laws and test-time compute in AI. Learn how resource scaling and optimization guide models like the new Ultralytics YOLO26.
Empirical observations of neural scaling in artificial intelligence demonstrate that a model's performance improves predictably as specific resources—such as compute power, dataset size, and the number of parameters—are increased. Initially popularized by research from organizations like OpenAI and Google DeepMind, these power-law relationships show that scaling up resources yields mathematically predictable reductions in cross-entropy loss. Understanding these principles allows researchers and engineers to efficiently allocate multi-million-dollar budgets, projecting exactly how large a neural network needs to be to hit a target accuracy before initiating a massive training run.
Link to this sectionThe Evolution of Pre-Training Scaling#
The original formulation of these rules, known as the Kaplan scaling laws introduced in 2020, established that language model performance scales smoothly with increased training compute. This framework was later refined by the Chinchilla Scaling Laws in 2022, which revealed that for optimal training, both model size and training data must be scaled in equal proportions. For instance, if you double a model's parameters, you must also double the number of training tokens. This paradigm successfully guided the development of modern Large Language Models (LLMs) built using frameworks like PyTorch and TensorFlow, ensuring that massive clusters of GPUs are utilized effectively without risking overfitting or wasting computation.
Link to this sectionThe Paradigm Shift: Test-Time Compute Scaling#
Between 2024 and 2025, as highlighted in annual AI progress reports, the AI industry experienced a massive shift toward inference-time scaling. As pre-training larger models began hitting diminishing returns and data availability walls, researchers discovered how to scale LLM test-time compute directly. By giving models more processing power during inference, they can dramatically improve complex reasoning capabilities.
Techniques like Chain-of-Thought (CoT) and Best-of-N sampling allow models to explore multiple paths before answering. This test-time scaling law, pioneered by advanced models like OpenAI's o1 and DeepSeek-R1, alongside other advanced reasoning models, proves that increasing prediction-phase compute can allow a much smaller, highly efficient architecture to outperform a massive legacy model on strict logical benchmarks.
Link to this sectionReal-World Applications#
Scaling principles govern development far beyond text generation, heavily dictating modern computer vision and object detection pipelines.
- Resource Allocation for Foundation Models: Companies developing autonomous driving systems rely on scaling formulas to calculate exactly how many annotated images are required to reduce Mean Average Precision (mAP) error rates to safe, production-ready levels. By utilizing the Ultralytics Platform for collaborative data annotation and cloud-based distributed training, teams can project their costs mathematically before deployment.
- Model Sizing and Edge Deployment: Scaling formulas directly influence the architectural design of modern models like Ultralytics YOLO26. By offering a unified family of models mathematically scaled from Nano (n) to Extra Large (x), developers can predictably trade off strict accuracy requirements against inference latency based on their specific edge hardware constraints.
Link to this sectionCode Example: Inference-Time Scaling in Computer Vision#
In computer vision, you can leverage a practical form of test-time scaling called Test-Time Augmentation (TTA). By spending additional compute during the prediction phase to evaluate multiple augmented versions of an image, the model predictably improves its detection confidence, mirroring the reasoning search techniques seen in advanced LLMs.
from ultralytics import YOLO
# Load the recommended YOLO26 model (nano version for high speed)
model = YOLO("yolo26n.pt")
# Perform standard inference (faster, lower test-time compute)
results_standard = model("https://ultralytics.com/images/bus.jpg")
# Perform inference-time scaling via Test-Time Augmentation (TTA)
# Predictably improves accuracy by utilizing more compute during prediction
results_tta = model("https://ultralytics.com/images/bus.jpg", augment=True)
print(f"Standard detections: {len(results_standard[0].boxes)}")
print(f"Scaled TTA detections: {len(results_tta[0].boxes)}")Link to this sectionScaling Laws vs. Related Concepts#
While closely related to hardware capabilities, AI scaling rules specifically measure software and algorithmic efficiency in relation to that hardware.
- Scaling Laws vs. Moore's Law: Moore's Law is a long-standing hardware observation predicting that the number of transistors on a microchip roughly doubles every two years. In contrast, AI scaling mathematically tracks how actual model capability improves given access to that expanding hardware pool.
- Training Scaling vs. Inference Scaling: Training formulas calculate the most compute-optimal mix of parameters and data during the initial creation of a model. Inference scaling, conversely, measures how dynamically spending extra compute on search and verification steps immediately prior to generating an output improves the final result without requiring any retraining.






