Continuous Batching

Learn how continuous batching optimizes GPU throughput and reduces latency. Discover how to use Ultralytics YOLO26 to maximize efficiency in production ML tasks.

Continuous batching is an advanced scheduling and inference optimization technique used in machine learning (ML) to maximize hardware utilization and throughput. In traditional static batching, an inference engine waits for a predetermined number of requests to accumulate before processing them simultaneously. This often leads to inefficiencies because the system must wait for the longest-running request in the batch to finish before releasing resources. Continuous batching, also known as dynamic or iteration-level batching, solves this by injecting new requests into the compute batch as soon as an active request completes, significantly reducing idle time on GPUs and improving overall efficiency.

To better understand how data is processed during model deployment, it is helpful to differentiate continuous batching from other related terms in the glossary:

Batch Size: This refers to the fixed number of samples processed simultaneously during training or inference. Traditional batch processing workflows rely on static sizes, whereas continuous batching allows the effective batch size to fluctuate dynamically based on incoming traffic.
Real-Time Inference: This concept focuses on minimizing inference latency for immediate predictions, processing single inputs as they arrive. Continuous batching bridges the gap between high-throughput static batching and low-latency real-time inference by maintaining high throughput without forcing fast requests to wait for slower ones.

Link to this sectionReal-World Applications#

Continuous batching is critical for production systems that handle high volumes of unpredictable requests. Here are two concrete examples of its application:

High-Throughput Text Generation: When serving Large Language Models (LLMs), generating responses for different users takes varying amounts of time depending on the output length. Frameworks leveraging continuous batching—such as vLLM on Ray Serve—can continuously stream newly generated tokens and immediately swap out finished conversations for new prompts. This method, originally popularized by research on iteration-level scheduling, drastically improves text generation throughput.
Asynchronous Video Analytics: In video understanding tasks, such as tracking vehicles across a city's traffic camera network, frames arrive at different intervals. Continuous batching allows object tracking models to dynamically process incoming video frames the millisecond resources free up, optimizing hardware acceleration pipelines for smart city dashboards.

Link to this sectionContinuous Processing in Vision Tasks#

When managing high-traffic model deployment practices, streaming inferences iteratively can simulate the benefits of dynamic batching by ensuring memory is freed up progressively rather than blocked. The following Python example demonstrates how to use the generator pattern with the model prediction API to handle a continuous stream of images efficiently.

from ultralytics import YOLO

# Load the latest Ultralytics YOLO26 model
model = YOLO("yolo26n.pt")

# Using stream=True acts as a generator, iteratively processing inputs
# to keep memory usage low and throughput high
results = model.predict(source=["img1.jpg", "img2.jpg", "img3.jpg"], stream=True)

# Process each result as soon as it completes
for result in results:
    print(f"Detected {len(result.boxes)} objects in this frame.")

Managing system-level resource scheduling requires a balance between raw speed and operational cost. Teams deploying massive computer vision (CV) and language models increasingly rely on advanced serving frameworks to manage these dynamic batches. For enterprise teams looking to streamline their infrastructure, the Ultralytics Platform offers robust tools for training, monitoring, and exporting models into highly optimized production environments.

Continuous Batching

Link to this sectionReal-World Applications#

Link to this sectionContinuous Processing in Vision Tasks#

Explore solutions

AI in Robotics

AI in Logistics

AI in Retail

AI in Healthcare

AI in Manufacturing

AI in Automotive

AI in Agriculture

AI in Robotics

AI in Logistics

AI in Retail

AI in Healthcare

AI in Manufacturing

AI in Automotive

AI in Agriculture

AI in Robotics

AI in Logistics

AI in Retail

AI in Healthcare

AI in Manufacturing

AI in Automotive

AI in Agriculture

Let's build the future of AI together!

Continuous Batching

Link to this sectionDistinguishing Related Concepts#

Link to this sectionReal-World Applications#

Link to this sectionContinuous Processing in Vision Tasks#

Explore solutions

AI in Robotics

AI in Logistics

AI in Retail

AI in Healthcare

AI in Manufacturing

AI in Automotive

AI in Agriculture

AI in Robotics

AI in Logistics

AI in Retail

AI in Healthcare

AI in Manufacturing

AI in Automotive

AI in Agriculture

AI in Robotics

AI in Logistics

AI in Retail

AI in Healthcare

AI in Manufacturing

AI in Automotive

AI in Agriculture

Let's build the future of AI together!