Glossary

Real-time Inference

Explore the power of real-time inference for instant AI predictions. Learn how Ultralytics YOLO26 delivers low-latency results for edge devices and robotics.

Real-time inference refers to the process where a trained machine learning (ML) model accepts live input data and generates predictions almost instantaneously. Unlike offline processing, where data is collected and analyzed in bulk at a later time, real-time inference occurs on the fly, enabling systems to react to their environment with speed and agility. This capability is the heartbeat of modern Artificial Intelligence (AI) applications, allowing devices to perceive, interpret, and act upon data within milliseconds.

The Importance of Low Latency

The primary metric for evaluating real-time performance is inference latency. This measures the time delay between the moment data is input into the model—such as a frame from a video camera—and the moment the model produces an output, such as a bounding box or classification label. For an application to be considered "real-time," the latency must be low enough to match the speed of the incoming data stream.

For example, in video understanding tasks running at 30 frames per second (FPS), the system has a strict time budget of approximately 33 milliseconds to process each frame. If the inference takes longer, the system introduces lag, potentially leading to dropped frames or delayed responses. achieving this often requires hardware acceleration using GPUs or specialized Edge AI devices like the NVIDIA Jetson.

Real-time Inference vs. Batch Inference

It is helpful to distinguish real-time workflows from batch processing. While both involve generating predictions, their goals and architectures differ significantly:

Real-time Inference: Prioritizes low latency. It processes single data points (or very small batches) as soon as they arrive. This is essential for interactive applications like autonomous vehicles, where a car must instantly detect a pedestrian to brake in time.
Batch Inference: Prioritizes high throughput. It collects a large volume of data and processes it all at once. This is suitable for non-urgent tasks, such as generating nightly inventory reports or analyzing historical big data trends.

Real-World Applications

The ability to make split-second decisions has transformed various industries by enabling automation in dynamic environments.

Smart Manufacturing: In AI in manufacturing, cameras positioned over conveyor belts use real-time inference to perform automated quality control. An object detection model can instantly identify defects or foreign objects in products moving at high speeds. If an anomaly is detected, the system triggers a robotic arm to remove the item immediately, ensuring only high-quality goods reach packaging.
Surveillance and Security: Modern security systems rely on computer vision to monitor perimeters. Instead of just recording footage, these cameras run real-time person detection or face recognition to alert security personnel of unauthorized access the moment it happens.
Robotics: In the field of AI in robotics, robots use pose estimation to navigate complex physical spaces. A warehouse robot must continuously infer the location of obstacles and human workers to plan its path safely and efficiently.

Optimization and Deployment

Deploying models for real-time applications often requires optimization to ensure they run efficiently on target hardware. Techniques such as model quantization reduce the precision of the model's weights (e.g., from float32 to int8) to decrease memory usage and increase inference speed with minimal impact on accuracy.

Developers can utilize the Ultralytics Platform to streamline this process. The platform simplifies training and allows users to export models to optimized formats like TensorRT for NVIDIA GPUs, OpenVINO for Intel CPUs, or TFLite for mobile deployment.

Code Example

The following Python snippet demonstrates how to run real-time inference on a webcam feed using the ultralytics library. It uses the YOLO26 Nano model, which is engineered specifically for high-speed performance on edge devices.

from ultralytics import YOLO

# Load the YOLO26 Nano model, optimized for speed and real-time tasks
model = YOLO("yolo26n.pt")

# Run inference on the default webcam (source="0")
# 'stream=True' returns a generator for memory-efficient processing
# 'show=True' displays the video feed with bounding boxes in real-time
results = model.predict(source="0", stream=True, show=True)

# Iterate through the generator to process frames as they arrive
for result in results:
    # Example: Print the number of objects detected in the current frame
    print(f"Detected {len(result.boxes)} objects")

Real-time Inference

Train Ultralytics YOLO models to streamline workflows across industries

Flexible enterprise licensing solution to power your innovation

Train AI models in seconds with Ultralytics YOLO

The Importance of Low Latency

Real-time Inference vs. Batch Inference

Real-World Applications

Optimization and Deployment

Code Example

Read more in this category

12 aerial imagery use cases powered by computer vision

What is monocular depth estimation? An overview

A look at using Ultralytics YOLO models for AI threat detection

Join the Ultralytics community