Explore Multi-Object Tracking (MOT) in computer vision. Learn how to detect and track entities using Ultralytics YOLO26 for autonomous driving, retail, and more.
Multi-Object Tracking (MOT) is a dynamic task in computer vision (CV) that involves detecting multiple distinct entities within a video stream and maintaining their identities over time. Unlike standard object detection, which treats every frame as an isolated snapshot, MOT introduces a temporal dimension to artificial intelligence (AI). By assigning a unique identification number (ID) to each detected instance—such as a specific pedestrian in a crowd or a vehicle on a highway—MOT algorithms allow systems to trace trajectories, analyze behavior, and understand interactions. This capability is fundamental to modern video understanding and enables machines to perceive continuity in a changing environment.
Most contemporary tracking systems operate on a "tracking-by-detection" paradigm. This approach separates the process into two main stages: identifying what is in the frame and then associating those findings with known objects from the past.
While the terminology is similar, Multi-Object Tracking (MOT) differs significantly from Single Object Tracking (SOT). SOT focuses on following one specific target initialized in the first frame, often ignoring all other entities. In contrast, MOT must handle an unknown and varying number of targets that may enter or leave the scene at any time. This makes MOT computationally more demanding, as it requires robust logic to handle track initiation, termination, and the complex interactions between multiple moving bodies.
The ability to track multiple entities simultaneously drives innovation across several major industries.
Ultralytics makes it straightforward to implement tracking with state-of-the-art models. The
track() method integrates detection and tracking logic seamlessly, supporting algorithms like
ByteTrack and
BoT-SORT. The example below demonstrates
tracking vehicles in a video using the recommended
YOLO26 model.
from ultralytics import YOLO
# Load the official YOLO26 small model
model = YOLO("yolo26s.pt")
# Track objects in a video file (or use '0' for webcam)
# The 'persist=True' argument keeps track IDs consistent between frames
results = model.track(source="traffic_analysis.mp4", show=True, persist=True)
# Print the IDs of objects tracked in the first frame
if results[0].boxes.id is not None:
print(f"Tracked IDs: {results[0].boxes.id.int().tolist()}")
Despite advancements, MOT remains a challenging field. Occlusion is a primary difficulty; when objects cross paths or hide behind obstacles, maintaining identity is complex. Crowded scenes, such as a busy marathon or a flock of birds, test the limits of data association algorithms. Furthermore, maintaining real-time inference speeds while processing high-resolution video streams requires efficient model architectures and often specialized hardware like NVIDIA Jetson devices.
To address these challenges, researchers are exploring end-to-end deep learning approaches that unify detection and tracking into a single network, as well as leveraging the Ultralytics Platform to annotate challenging datasets and train robust custom models.