Yolo Vision Shenzhen
Shenzhen
Join now
Glossary

Continuous Batching

Learn how continuous batching optimizes GPU throughput and reduces latency. Discover how to use Ultralytics YOLO26 to maximize efficiency in production ML tasks.

Continuous batching is an advanced scheduling and inference optimization technique used in machine learning (ML) to maximize hardware utilization and throughput. In traditional static batching, an inference engine waits for a predetermined number of requests to accumulate before processing them simultaneously. This often leads to inefficiencies because the system must wait for the longest-running request in the batch to finish before releasing resources. Continuous batching, also known as dynamic or iteration-level batching, solves this by injecting new requests into the compute batch as soon as an active request completes, significantly reducing idle time on GPUs and improving overall efficiency.

Distinguishing Related Concepts

To better understand how data is processed during model deployment, it is helpful to differentiate continuous batching from other related terms in the glossary:

  • Batch Size: This refers to the fixed number of samples processed simultaneously during training or inference. Traditional batch processing workflows rely on static sizes, whereas continuous batching allows the effective batch size to fluctuate dynamically based on incoming traffic.
  • Real-Time Inference: This concept focuses on minimizing inference latency for immediate predictions, processing single inputs as they arrive. Continuous batching bridges the gap between high-throughput static batching and low-latency real-time inference by maintaining high throughput without forcing fast requests to wait for slower ones.

Real-World Applications

Continuous batching is critical for production systems that handle high volumes of unpredictable requests. Here are two concrete examples of its application:

  1. High-Throughput Text Generation: When serving Large Language Models (LLMs), generating responses for different users takes varying amounts of time depending on the output length. Frameworks leveraging continuous batching—such as vLLM on Ray Serve—can continuously stream newly generated tokens and immediately swap out finished conversations for new prompts. This method, originally popularized by research on iteration-level scheduling, drastically improves text generation throughput.
  2. Asynchronous Video Analytics: In video understanding tasks, such as tracking vehicles across a city's traffic camera network, frames arrive at different intervals. Continuous batching allows object tracking models to dynamically process incoming video frames the millisecond resources free up, optimizing hardware acceleration pipelines for smart city dashboards.

Continuous Processing in Vision Tasks

When managing high-traffic model deployment practices, streaming inferences iteratively can simulate the benefits of dynamic batching by ensuring memory is freed up progressively rather than blocked. The following Python example demonstrates how to use the generator pattern with the model prediction API to handle a continuous stream of images efficiently.

from ultralytics import YOLO

# Load the latest Ultralytics YOLO26 model
model = YOLO("yolo26n.pt")

# Using stream=True acts as a generator, iteratively processing inputs
# to keep memory usage low and throughput high
results = model.predict(source=["img1.jpg", "img2.jpg", "img3.jpg"], stream=True)

# Process each result as soon as it completes
for result in results:
    print(f"Detected {len(result.boxes)} objects in this frame.")

Managing system-level resource scheduling requires a balance between raw speed and operational cost. Teams deploying massive computer vision (CV) and language models increasingly rely on advanced serving frameworks to manage these dynamic batches. For enterprise teams looking to streamline their infrastructure, the Ultralytics Platform offers robust tools for training, monitoring, and exporting models into highly optimized production environments.

Power up with Ultralytics YOLO

Get advanced AI vision for your projects. Find the right license for your goals today.

Explore licensing options