Integrating Ultralytics YOLO models with TensorRT

Consider a self-driving car moving through a busy street with only milliseconds to detect a pedestrian stepping off the curb. At the same time, it might need to recognize a stop sign partially hidden by a tree or react quickly to a nearby vehicle swerving into its lane. In such situations, speed and real-time responses are critical.

This is where artificial intelligence (AI), specifically computer vision, a branch of AI that helps machines interpret visual data, plays a key role. For computer vision solutions to work reliably in real-world environments, oftentimes, they often need to process information quickly, handle multiple tasks at once, and use memory efficiently.

One way to achieve this is through hardware acceleration, using specialized devices like graphics processing units (GPUs) to run models faster. NVIDIA GPUs are especially well known for such tasks, thanks to their ability to deliver low latency and high throughput.

However, running a model on a GPU as-is doesn’t always guarantee optimal performance. Vision AI models typically require optimization to fully leverage the capabilities of hardware devices. In order to achieve full performance with specific hardware, we need to compile the model to use the specific set of instructions for the hardware.

For example, TensorRT is an export format and optimization library developed by NVIDIA to enhance performance on high-end machines. It uses advanced techniques to significantly reduce inference time while maintaining accuracy.

Fig 1. NVIDIA TensorRT enables models to run optimally on various NVIDIA devices.

‍

In this article, we’ll explore the TensorRT integration supported by Ultralytics and walk through how you can export your YOLO11 model for faster, more efficient deployment on NVIDIA hardware. Let’s get started!

An overview of TensorRT

TensorRT is a toolkit developed by NVIDIA to help AI models run faster and more efficiently on NVIDIA GPUs. It’s designed for real-world applications where speed and performance really matter, like self-driving cars and quality control in manufacturing and pharmaceuticals.

TensorRT includes tools like compilers and model optimizers that can work behind the scenes to make sure your models run with low latency and can handle a higher throughput.

The TensorRT integration supported by Ultralytics works by optimizing your YOLO model to run more efficiently on GPUs using methods like reducing precision. This refers to using lower-bit formats, such as 16-bit floating-point (FP16) or 8-bit integer (INT8), to represent model data, which reduces memory usage and speeds up computation with minimal impact on accuracy.

Also, compatible neural network layers are fused in optimized TensorRT models to reduce memory usage, resulting in faster and more efficient inference.

Fig 2. A look at TensorRT’s layer fusion technique.

‍

Key features of the TensorRT export format

Before we discuss how you can export YOLO11 using the TensorRT integration, let’s take a look at some key features of the TensorRT model format:

Easy framework integration: TensorRT supports direct integration with popular AI frameworks like PyTorch, Hugging Face, and ONNX, offering up to 6x faster performance. It also supports MATLAB, enabling the development of high-speed AI engines on platforms such as Jetson, NVIDIA DRIVE, and data centers.

Scalable deployment with Triton: Models optimized in the TensorRT format can be deployed at scale using NVIDIA Triton Inference Server, which enhances efficiency through features like input batching, concurrent model execution, model ensemble support, and real-time audio/video streaming.

Flexible across devices: From small edge devices to powerful servers, TensorRT works across the entire NVIDIA ecosystem, supporting tools like DeepStream for video, Riva for speech AI, and others for cybersecurity, recommendations, and more.

How does the TensorRT integration work?

Exporting Ultralytics YOLO models like Ultralytics YOLO11 to the TensorRT model format is easy. Let’s walk through the steps involved.

To get started, you can install the Ultralytics Python package using a package manager like ‘pip.’ This can be done by running the command “pip install ultralytics” in your command prompt or terminal.

After successfully installing the Ultralytics Python Package, you can train, test, fine-tune, export, and deploy models for various computer vision tasks, such as object detection, classification, and instance segmentation. While installing the package, if you encounter any difficulties, you can refer to the Common Issues guide for solutions and tips.

For the next step, you’ll need an NVIDIA device. Use the code snippet below to load and export YOLOv11 to the TensorRT model format. It loads a pre-trained nano variant of the YOLO11 model (yolo11n.pt) and exports it as a TensorRT engine file (yolo11n.engine), making it ready for deployment across NVIDIA devices.

from ultralytics import YOLO

model = YOLO("yolo11n.pt")

model.export(format="engine")

After converting your model to TensorRT format, you can deploy it for various applications.

The example below shows how to load the exported YOLO11 model (yolo11n.engine) and run an inference using it. Inferencing involves using the trained model to make predictions on new data. In this case, we’ll use an input image of a dog to test the model.

tensorrt_model = YOLO("yolo11n.engine")

results = tensorrt_model("https://images.pexels.com/photos/1254140/pexels-photo-1254140.jpeg?auto=compress&cs=tinysrgb&w=1260&h=750&dpr=2.jpg", save=True)

When you run this code, the following output image will be saved in the runs/detect/predict folder.

Fig 3. The result of running an inference using the exported YOLO11 model in TensorRT format.

‍

When to leverage the TensorRT integration

The Ultralytics Python package supports various integrations that allow exporting YOLO models to different formats like TorchScript, CoreML, ONNX, and TensorRT. So, when should you choose to use the TensorRT integration?

Here are a few factors that set the TensorRT model format apart from other export integration options:

Smaller model size: Exporting a YOLO model to the TensorRT format with INT8 precision can significantly reduce the model size. Quantization from FP32 to INT8 can lead to a 4x reduction in model size, which enables faster download times, lower storage requirements, and a reduced memory footprint during deployment.

Lower power usage: INT8 quantization not only reduces model size but also decreases power consumption. Reduced precision operations for INT8 exported YOLO models can consume less power compared to FP32 models, which is especially beneficial for battery-powered devices like drones, smartphones, or edge devices.

Faster performance: Combining YOLO's efficient architecture with TensorRT's INT8 optimization can improve inference speeds.

Applications of YOLO11 and the TensorRT model format

Ultralytics YOLO models exported to the TensorRT format can be deployed across a wide range of real-world scenarios. These optimized models are especially useful where fast, efficient AI performance is key. Let’s explore some interesting examples of how they can be used.

Smart checkout counters in retail stores

A wide range of tasks in retail stores, such as scanning barcodes, weighing products, or packaging items, are still handled manually by staff. However, relying solely on employees can slow down operations and lead to customer frustration, especially at the checkout. Long lines are inconvenient for both shoppers and store owners. Smart self-checkout counters are a great solution to this problem.

These counters use computer vision and GPUs to speed up the process, helping reduce wait times. Computer vision enables these systems to see and understand their environment through tasks like object detection. Advanced models like YOLO11, when optimized with tools like TensorRT, can run much faster on GPU devices.

These exported models are well-suited for smart retail setups using compact but powerful hardware devices like the NVIDIA Jetson Nano, designed specifically for edge AI applications.

Fig 4. An example of a smart checkout counter.

‍

Automated defect detection in manufacturing

A computer vision model like YOLO11 can be custom-trained to detect defective products in the manufacturing industry. Once trained, the model can be exported to the TensorRT format for deployment in facilities equipped with high-performance AI systems.

As products move along conveyor belts, cameras capture images, and the YOLO11 model, running in TensorRT format, analyzes them in real time to spot defects. This setup allows companies to catch issues quickly and accurately, reducing errors and improving efficiency.

Similarly, industries such as pharmaceuticals are using these types of systems to identify defects in medical packaging. In fact, the global market for smart defect detection systems is set to grow to $5 billion by 2026.

Fig 5. Using YOLO to detect defects in the pharmaceutical industry.

‍

Considerations to keep in mind while using TensorRT

While the TensorRT integration brings to the table many advantages, like faster inference speeds and reduced latency, here are a few limitations to keep in mind:

Slight drop in accuracy: When you export your model in TensorRT format, your exported model might not be as accurate as the original. Performance metrics like precision, recall, and how well the model detects objects (mAP scores) could drop slightly. This can be mitigated by using a representative dataset during quantization.

Increased debugging complexity: Optimizations done by TensorRT can make it trickier to trace errors or understand unexpected behavior, especially when comparing results with the original model.

Batch size sensitivity: TensorRT's performance gains are more pronounced with larger batch sizes. For applications processing single images or small batches, the performance improvements might be less significant.

Key takeaways

Exporting Ultralytics YOLO models to the TensorRT format makes them run significantly faster and more efficiently, making them ideal for real-time tasks like detecting defects in factories, powering smart checkout systems, or monitoring busy urban areas.

This optimization helps the models perform better on NVIDIA GPUs by speeding up predictions and reducing memory and power usage. While there are a few limitations, the performance boost makes TensorRT integration a great choice for anyone building high-speed computer vision systems on NVIDIA hardware.

Want to learn more about AI? Explore our GitHub repository, connect with our community, and check out our licensing options to jumpstart your computer vision project. Find out more about innovations like AI in manufacturing and computer vision in the logistics industry on our solutions pages.

Optimizing Ultralytics YOLO models with the TensorRT integration

An overview of TensorRT

Key features of the TensorRT export format

How does the TensorRT integration work?

When to leverage the TensorRT integration

Applications of YOLO11 and the TensorRT model format

Smart checkout counters in retail stores

Automated defect detection in manufacturing

Considerations to keep in mind while using TensorRT

Key takeaways

Read more in this category

Deploy Ultralytics YOLO models using the ExecuTorch integration

A guide on U-Net architecture and its applications

Popular open-source OCR models and how they work

Let’s build the future
of AI together!

Optimizing Ultralytics YOLO models with the TensorRT integration

An overview of TensorRT

Key features of the TensorRT export format

How does the TensorRT integration work?

When to leverage the TensorRT integration

Applications of YOLO11 and the TensorRT model format

Smart checkout counters in retail stores

Automated defect detection in manufacturing

Considerations to keep in mind while using TensorRT

Key takeaways

Read more in this category

Deploy Ultralytics YOLO models using the ExecuTorch integration

A guide on U-Net architecture and its applications

Popular open-source OCR models and how they work

Let’s build the future of AI together!

Let’s build the future
of AI together!