Ultralytics Benchmarks
How YOLO26 performs on the Ultralytics Platform's NVIDIA GPUs — measured training throughput, auto-batch, memory, power, and cost-efficiency, so you can pick the right GPU for your budget and timeline.
GPU Training Throughput
YOLO26 detection trained on COCO at 640px with auto-batch across every NVIDIA GPU on the Platform. Switch the model size and chart metric, or filter by GPU generation; the table is fully sortable.
H200 SXM | 141 GB | 490.6 | 221 | 40.2 GB | 361 W | $4.39 | 402.3K | |
B300 | 288 GB | 474.6 | 411 | 73.4 GB | 457 W | $7.39 | 231.2K | |
H200 NVL | 143 GB | 469.3 | 221 | 39.8 GB | 283 W | $3.39 | 498.4K | |
H100 NVL | 94 GB | 432.1 | 147 | 26.8 GB | 274 W | $3.19 | 487.6K | |
H100 SXM | 80 GB | 424.3 | 123 | 23.1 GB | 315 W | $3.29 | 464.3K | |
RTX PRO 6000 | 96 GB | 420.6 | 149 | 27.6 GB | 319 W | $2.09 | 724.5K | |
B200 | 180 GB | 404.9 | 281 | 50.4 GB | 419 W | $5.89 | 247.5K | |
RTX 5090 | 32 GB | 356 | 49 | 9.7 GB | 304 W | $0.99 | 1.3M | |
RTX 4090 | 24 GB | 306 | 35 | 13.5 GB | 235 W | $0.69 | 1.6M | |
H100 PCIe | 80 GB | 302.4 | 123 | 22.6 GB | 197 W | $2.89 | 376.7K | |
RTX 6000 Ada | 48 GB | 294.8 | 73 | 13.9 GB | 254 W | $0.77 | 1.4M | |
A100 SXM | 80 GB | 286.8 | 123 | 23.1 GB | 342 W | $1.49 | 692.9K | |
A100 PCIe | 80 GB | 283.2 | 123 | 23.2 GB | 328 W | $1.39 | 733.5K | |
L40S | 48 GB | 265.1 | 70 | 13.3 GB | 258 W | $0.86 | 1.1M | |
L40 | 48 GB | 255.5 | 68 | 13.1 GB | 255 W | $0.99 | 929.1K | |
RTX PRO 4500 | 32 GB | 249.7 | 49 | 11.0 GB | 161 W | $0.64 | 1.4M | |
RTX A6000 | 48 GB | 209.9 | 73 | 13.9 GB | 278 W | $0.49 | 1.5M | |
RTX 3090 | 24 GB | 184.8 | 35 | 13.3 GB | 312 W | $0.46 | 1.4M | |
RTX A5000 | 24 GB | 171.2 | 35 | 7.1 GB | 216 W | $0.27 | 2.3M | |
A40 | 48 GB | 161.2 | 70 | 13.4 GB | 262 W | $0.44 | 1.3M | |
RTX A4500 | 20 GB | 149.2 | 30 | 9.1 GB | 191 W | $0.25 | 2.1M | |
RTX 4000 Ada | 20 GB | 127.9 | 30 | 6.5 GB | 98 W | $0.26 | 1.8M | |
L4 | 24 GB | 116 | 33 | 10.2 GB | 78 W | $0.39 | 1.1M | |
RTX 2000 Ada | 16 GB | 88 | 22 | 4.7 GB | 60 W | $0.24 | 1.3M |
Training methodology
We use training throughput — images processed per second during training — as the yardstick; it correlates directly with time-to-solution. Every result is measured on Ultralytics Platform GPUs — the same NVIDIA hardware you rent for cloud training in one click, from entry-level workstation cards up to flagship data-center GPUs. We train YOLO26 at all five sizes (n/s/m/l/x) so you can match a model to your hardware budget.
Settings. 2 epochs on 25% of COCO at 640px, AMP mixed precision, single GPU, with auto-batch (batch=-1) selecting the largest batch that fits in memory. We report the steady-state second epoch, which excludes first-epoch warmup (dataset caching, CUDA graph capture) and the end-of-run validation pass. Resolved batch size, peak VRAM (including the CUDA context), and peak board power are recorded directly from each GPU.
Cost-efficiency. Images per dollar = throughput × 3600 ÷ hourly price, using Ultralytics Platform on-demand pricing — it often reorders the ranking dramatically, as value cards out-earn flagship GPUs per dollar. Measured on ultralytics 8.4.68, torch 2.8, CUDA 12.8. See also the Train and Benchmark mode docs.
Train on the Best GPUs for Less
24 NVIDIA GPUs starting at $0.24/hr — from Ampere to Blackwell. No markup, no minimums, no commitment.
Inference Benchmarks
Predict latency and throughput on CPU and GPU across export formats — PyTorch, ONNX, TensorRT, OpenVINO, CoreML, TF.js, and more — so you can pick the fastest path to deployment.
Ready to build your next vision AI project?
Built on Ultralytics open source with 132.7k+ GitHub stars. Start training models in minutes.