Ultralytics YOLO ile gerçek zamanlı çıkarımın otonom sürüş ve güvenlik sistemleri gibi yapay zeka uygulamaları için nasıl anlık tahminler sağladığını keşfedin.
Real-time inference refers to the process where a trained machine learning (ML) model accepts live input data and generates predictions almost instantaneously. Unlike offline processing, where data is collected and analyzed in bulk at a later time, real-time inference occurs on the fly, enabling systems to react to their environment with speed and agility. This capability is the heartbeat of modern Artificial Intelligence (AI) applications, allowing devices to perceive, interpret, and act upon data within milliseconds.
The primary metric for evaluating real-time performance is inference latency. This measures the time delay between the moment data is input into the model—such as a frame from a video camera—and the moment the model produces an output, such as a bounding box or classification label. For an application to be considered "real-time," the latency must be low enough to match the speed of the incoming data stream.
For example, in video understanding tasks running at 30 frames per second (FPS), the system has a strict time budget of approximately 33 milliseconds to process each frame. If the inference takes longer, the system introduces lag, potentially leading to dropped frames or delayed responses. achieving this often requires hardware acceleration using GPUs or specialized Edge AI devices like the NVIDIA Jetson.
It is helpful to distinguish real-time workflows from batch processing. While both involve generating predictions, their goals and architectures differ significantly:
The ability to make split-second decisions has transformed various industries by enabling automation in dynamic environments.
Deploying models for real-time applications often requires optimization to ensure they run efficiently on target hardware. Techniques such as model quantization reduce the precision of the model's weights (e.g., from float32 to int8) to decrease memory usage and increase inference speed with minimal impact on accuracy.
Developers can utilize the Ultralytics Platform to streamline this process. The platform simplifies training and allows users to export models to optimized formats like TensorRT for NVIDIA GPUs, OpenVINO for Intel CPUs, or TFLite for mobile deployment.
The following Python snippet demonstrates how to run real-time inference on a webcam feed using the
ultralytics library. It uses the YOLO26 Nano
model, which is engineered specifically for high-speed performance on edge devices.
from ultralytics import YOLO
# Load the YOLO26 Nano model, optimized for speed and real-time tasks
model = YOLO("yolo26n.pt")
# Run inference on the default webcam (source="0")
# 'stream=True' returns a generator for memory-efficient processing
# 'show=True' displays the video feed with bounding boxes in real-time
results = model.predict(source="0", stream=True, show=True)
# Iterate through the generator to process frames as they arrive
for result in results:
# Example: Print the number of objects detected in the current frame
print(f"Detected {len(result.boxes)} objects")
