Deploy Ultralytics YOLOv5 With Neural Magic’s DeepSparse for GPU-Class Performance on CPUs
Empower Ultralytics YOLOv5 model training & deployment with Neural Magic's DeepSparse for GPU-class performance on CPUs. Achieve faster, scalable YOLOv5 deployments.

Want to accelerate the training and deployment of your YOLOv5 models? We’ve got you covered! Introducing our newest partner, Neural Magic. As Neural Magic provides software tools that emphasize peak model performance and workflow simplicity, it’s only natural that we’ve come together to offer a solution to make the YOLOv5 deployment process even better.
DeepSparse is Neural Magic’s CPU inference runtime, which takes advantage of sparsity and low-precision arithmetic within neural networks to offer exceptional performance on commodity hardware. For instance, compared to the ONNX Runtime baseline, DeepSparse offers a 5.8x speed-up for YOLOv5s running on the same machine!

For the first time, your deep learning workloads can meet the performance demands of production without the complexity and costs of hardware accelerators. Put simply, DeepSparse gives you the performance of GPUs and the simplicity of software:
- Flexible Deployments: Run consistently across cloud, data center, and edge with any hardware provider
- Infinite Scalability: Scale out with standard Kubernetes, vertically to 100s of cores, or fully abstracted with serverless
- Easy Integration: Use clean APIs for integrating your model into an application and monitoring it in production
Link to this sectionAchieve GPU-Class Performance on Commodity CPUs#
DeepSparse takes advantage of model sparsity to gain its performance speedup.
Sparsification through pruning and quantization allows order-of-magnitude reductions in the size and compute needed to execute a network while maintaining high accuracy. DeepSparse is sparsity-aware, skipping the multiply-adds by zero and shrinking the amount of compute in a forward pass. Since sparse computation is memory-bound, DeepSparse executes the network depth-wise, breaking the problem into Tensor Columns, which are vertical stripes of computation that fit in cache.

Sparse networks with compressed computation, executed depth-wise in cache, allow DeepSparse to deliver GPU-class performance on CPUs!
Link to this sectionCreate A Sparse Version of YOLOv5 Trained on Custom Data#
Neural Magic's open-source model repository, SparseZoo, contains pre-sparsified checkpoints of each YOLOv5 model. Using SparseML, which is integrated with Ultralytics, you can fine-tune a sparse checkpoint onto your data with a single CLI command.
Link to this sectionDeploy YOLOv5 With DeepSparse#
Link to this sectionInstall DeepSparse#
Run the following to install DeepSparse. We recommend you use a virtual environment with Python.
pip install deepsparse[server,yolo,onnxruntime]Link to this sectionCollect an ONNX File#
DeepSparse accepts a model in the ONNX format, passed either as:
- A local path to an ONNX model
- A SparseZoo stub that identifies a model in the SparseZoo
We will compare the standard dense YOLOv5s to the pruned-quantized YOLOv5s, identified by the following SparseZoo stubs:
zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/base-none
zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned65_quant-noneLink to this sectionDeploy a Model#
DeepSparse offers convenient APIs for integrating your model into an application.
To try the deployment examples below, pull down a sample image for the example and save as basilica.jpg with the following command:
wget -O basilica.jpg https://raw.githubusercontent.com/neuralmagic/deepsparse/main/src/deepsparse/yolo/sample_images/basilica.jpgLink to this sectionPython API#
Pipelines wrap pre-processing and output post-processing around the runtime, providing a clean interface for adding DeepSparse to an application. The DeepSparse-Ultralytics integration includes an out-of-the-box Pipeline that accepts raw images and outputs the bounding boxes.
Create a Pipeline and run inference:
from deepsparse import Pipeline
# list of images in local filesystem
images = ["basilica.jpg"]
# create Pipeline
model_stub = "zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned65_quant-none"
yolo_pipeline = Pipeline.create(
task="yolo",
model_path=model_stub,
)
# run inference on images, receive bounding boxes + classes
pipeline_outputs = yolo_pipeline(images=images, iou_thres=0.6, conf_thres=0.001)
print(pipeline_outputs)If you are running in the cloud, you may get an error that open-cv cannot find libGL.so.1. Running the following on Ubuntu installs it:
apt-get install libgl1-mesa-glxLink to this sectionHTTP Server#
DeepSparse Server runs on top of the popular FastAPI web framework and Uvicorn web server. With just a single CLI command, you can easily set up a model service endpoint with DeepSparse. The Server supports any Pipeline from DeepSparse, including object detection with YOLOv5, enabling you to send raw images to the endpoint and receive the bounding boxes.
Spin up the Server with the pruned-quantized YOLOv5s:
deepsparse.server \
--task yolo \
--model_path zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned65_quant-noneAn example request, using Python's requests package:
import requests, json
# list of images for inference (local files on client side)
path = ['basilica.jpg']
files = [('request', open(img, 'rb')) for img in path]
# send request over HTTP to /predict/from_files endpoint
url = 'http://0.0.0.0:5543/predict/from_files'
resp = requests.post(url=url, files=files)
# response is returned in JSON
annotations = json.loads(resp.text) # dictionary of annotation results
bounding_boxes = annotations["boxes"]
labels = annotations["labels"]Link to this sectionAnnotate CLI#
You can also use the annotate command to have the engine save an annotated photo on disk. Try --source 0 to annotate your live webcam feed!
deepsparse.object_detection.annotate --model_filepath zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned65_quant-none --source basilica.jpgRunning the above command will create an annotation-results folder and save the annotated image inside.

Link to this sectionBenchmark Performance#
Using DeepSparse's benchmarking script, we will compare DeepSparse's throughput to ONNX Runtime's throughput on YOLOv5s.
The benchmarks were run on an AWS c6i.8xlarge instance (16 cores).
Link to this sectionBatch 32 Performance Comparison#
Link to this sectionONNX Runtime Baseline#
At batch 32, ONNX Runtime achieves 42 images/sec with the standard dense YOLOv5s:
deepsparse.benchmark zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/base-none -s sync -b 32 -nstreams 1 -e onnxruntimeOriginal Model Path: zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/base-none Batch Size: 32 Scenario: sync Throughput (items/sec): 41.9025
Link to this sectionDeepSparse Dense Performance#
While DeepSparse offers its best performance with optimized sparse models, it also performs well with the standard dense YOLOv5s.
At batch 32, DeepSparse achieves 70 images/sec with the standard dense YOLOv5s—a 1.7x performance improvement over ORT!
deepsparse.benchmark zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/base-none -s sync -b 32 -nstreams 1Original Model Path: zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/base-none Batch Size: 32 Scenario: sync Throughput (items/sec): 69.5546
Link to this sectionDeepSparse Sparse Performance#
When sparsity is applied to the model, DeepSparse's performance gains over ONNX Runtime are even stronger.
At batch 32, DeepSparse achieves 241 images/sec with the pruned-quantized YOLOv5s—a 5.8x performance improvement over ORT!
deepsparse.benchmark zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned65_quant-none -s sync -b 32 -nstreams 1Original Model Path: zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned65_quant-none Batch Size: 32 Scenario: sync Throughput (items/sec): 241.2452
Link to this sectionBatch 1 Performance Comparison#
DeepSparse is also able to gain a speed-up over ONNX Runtime for the latency-sensitive, batch 1 scenario.
Link to this sectionONNX Runtime Baseline#
At batch 1, ONNX Runtime achieves 48 images/sec with the standard, dense YOLOv5s.
deepsparse.benchmark zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/base-none -s sync -b 1 -nstreams 1 -e onnxruntimeOriginal Model Path: zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/base-none Batch Size: 1 Scenario: sync Throughput (items/sec): 48.0921
Link to this sectionDeepSparse Sparse Performance#
When sparsity is applied to the model, DeepSparse's performance gains over ONNX Runtime are even stronger.
At batch 1, DeepSparse achieves 135 images/sec with the pruned-quantized YOLOv5s—a 2.8x performance improvement over ONNX Runtime!
deepsparse.benchmark zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned65_quant-none -s sync -b 1 -nstreams 1Original Model Path: zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned65_quant-none Batch Size: 1 Scenario: sync Throughput (items/sec): 134.9468
Since c6i.8xlarge instances have VNNI instructions, DeepSparse's throughput can be pushed further if weights are pruned in blocks of 4.
At batch 1, DeepSparse achieves 180 items/sec with a 4-block pruned-quantized YOLOv5s—a 3.7x performance gain over ONNX Runtime!
deepsparse.benchmark zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned35_quant-none-vnni -s sync -b 1 -nstreams 1Original Model Path: zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned35_quant-none-vnni Batch Size: 1 Scenario: sync Throughput (items/sec): 179.7375
And voila! You’re ready to optimize your YOLOv5 deployment with DeepSparse.
Link to this sectionGet Started With YOLOv5 and DeepSparse#
To get in touch with us, join our community and leave us your questions and comments. Check out the Ultralytics YOLOv5 repository and full Neural Magic documentation for deploying YOLOv5.
At Ultralytics, we commercially partner with other startups to help us fund the research and development of our awesome open-source tools, like YOLOv5, to keep them free for everybody. This article may contain affiliate links to those partners.






