Learn how anchor boxes act as templates for object detection. Explore their role in localization, compare anchor-based vs. anchor-free models like [YOLO26](https://docs.ultralytics.com/models/yolo26/), and discover real-world CV applications.
Anchor boxes are predefined reference rectangles of specific aspect ratios and scales that are placed across an image to assist object detection models in locating and classifying objects. Rather than asking a neural network to predict the exact size and position of an object from scratch—which can be unstable due to the vast variety of object shapes—the model uses these fixed templates as a starting point. By learning to predict how much to adjust, or "regress," these initial boxes to fit the ground truth, the system can achieve faster convergence and higher accuracy. This technique fundamentally transformed the field of computer vision (CV) by simplifying the complex task of localization into a more manageable optimization problem.
In classical anchor-based detectors, the input image is divided into a grid of cells. At each cell location, the network generates multiple anchor boxes with different geometries. For instance, to simultaneously detect a tall pedestrian and a wide car, the model might propose a tall, narrow box and a short, wide box at the same center point.
During model training, these anchors are matched against actual objects using a metric called Intersection over Union (IoU). Anchors that overlap significantly with a labeled object are designated as "positive" samples. The network then learns two parallel tasks:
This approach allows the model to handle multiple objects of different sizes located near each other, as each object can be assigned to the anchor that best matches its shape.
Although newer architectures are moving toward anchor-free designs, anchor boxes remain vital in many established production systems where object characteristics are predictable.
It is important to distinguish between traditional anchor-based methods and modern anchor-free detectors.
While modern high-level APIs like the Ultralytics Platform abstract away these details during training, understanding anchors is useful when working with older model architectures or analyzing model config files. The following snippet demonstrates how to load a model and inspect its configuration, where anchor settings (if present) would typically be defined.
from ultralytics import YOLO
# Load a pre-trained YOLO model (YOLO26 is anchor-free, but legacy configs act similarly)
model = YOLO("yolo26n.pt")
# Inspect the model's stride, which relates to grid cell sizing in detection
print(f"Model strides: {model.model.stride}")
# For older anchor-based models, anchors might be stored in the model's attributes
# Modern anchor-free models calculate targets dynamically without fixed boxes
if hasattr(model.model, "anchors"):
print(f"Anchors: {model.model.anchors}")
else:
print("This model architecture is anchor-free.")
While effective, anchor boxes introduce complexity. The vast number of anchors generated—often tens of thousands per image—creates a class imbalance problem, as most anchors cover only the background. Techniques like Focal Loss are used to mitigate this by down-weighting easy background examples. Additionally, the final output usually requires Non-Maximum Suppression (NMS) to filter out redundant overlapping boxes, ensuring that only the most confident detection for each object remains.