Yolo Vision Shenzhen
Shenzhen
Join now

ultralytics platform

Deploy vision AI models across 43 global regions

Your trained models from browser testing to production endpoints in just a few clicks with auto-scaling, real-time monitoring, and 17+ export formats. The end-to-end solution for real-world use case deployment.

User interface displaying export options for PyTorch models including ONNX, TorchScript, OpenVINO, TensorRT, CoreML, TF Lite, TF SavedModel, and TF GraphDef, with a world map showing 3 deployments in green and multiple red location pins.

43+

Deployment regions

17+

Export formats

500+

Active deployments

Deploy to 43 regions worldwide

Deploy your models to dedicated endpoints across 43 global regions, spanning the Americas, Europe, Asia-Pacific, and the Middle East. Each endpoint is a single-tenant service with its own URL, auto-scaling, and independent monitoring.
World map showing various locations marked with colored pins in North America, Europe, and Asia.
Dashboard showing model performance metrics with mAP50 at 96.2%, mAP50-95 at 90.1%, and Precision at 87.2%, alongside a logs panel for the YOLO26s segmentation model deployed in Paris.

Auto-scaling that matches your traffic

Dedicated endpoints scale up automatically to handle traffic spikes and scale down to zero when idle, so you're never paying for compute you're not using.

Scale to zero by default. No cost when your endpoint isn't receiving requests.

No rate limits. No rate limits. Unlike shared inference, dedicated endpoints have no throughput caps, limited only by your endpoint's resources.

Configurable resources. Choose CPU cores (1–8) and memory (1–32 GB) to match your model's requirements and traffic patterns.

17+ export formats. Your model. Any environment.

Ultralytics Platform supports both cloud and edge deployment. All Ultralytics YOLO models are natively optimized to run efficiently across environments, delivering reliable performance even on hardware with limited compute resources.

List of export formats for PyTorch models including ONNX, TorchScript, OpenVINO, TensorRT GPU, CoreML, and TF Lite with their respective icons and format codes.
Dashboard showing 13,959 total requests, 3 active deployments, 0% error rate, and 14 ms P95 latency in the last 24 hours.

Monitor everything in production

Full real-time visibility into how your models perform. Once your models are live, the deployments dashboard gives you a centralized overview of every running endpoint, with the metrics you need to keep your frameworks running reliably.

Request volume. Total requests across all endpoints over the last 24 hours.

P95 latency. 95th percentile response time to track real-world use case performance.

Error rates. Highlighted alerts when error rates exceed 5%, with severity-filtered logs to help you diagnose issues fast.

Health checks. Live status indicators with automatic retry when endpoints are unhealthy. Response latency is displayed alongside each check.

Integrate in minutes

Every deployed endpoint comes with auto-generated code examples in Python, JavaScript, and cURL, pre-populated with your actual endpoint URL and API key. Copy, paste, and start sending inference requests from any application.

Python code snippet for sending an image to a deployment endpoint using requests with authorization and inference parameters.

Need to train a model first?

Ultralytics Platform connects annotation, training, and deployment in a single platform.

Frequently asked questions

Can I deploy the same model to multiple regions?

Yes. Each model can be deployed to multiple regions simultaneously. Your plan determines the total number of endpoints available, 3 for Free, 10 for Pro, and unlimited for Enterprise. This allows you to serve users globally with low-latency endpoints in each region.

How much does deployment cost?

Dedicated endpoints are billed based on CPU, memory, and request volume. With scale-to-zero enabled by default, you only pay for active inference time, there's no cost when your endpoint isn't receiving requests. Shared inference is included with your platform plan.

What's the difference between shared and dedicated inference?

Shared inference runs on a multi-tenant service across 3 regions and is rate-limited to 20 requests per minute. It's best for development and quick testing. Dedicated endpoints are single-tenant services deployed to any of 43 regions with no rate limits, consistent latency, and configurable resources, built for scalable production workloads.

How long does deployment take?

Dedicated endpoint deployment typically takes one to two minutes. This includes container provisioning, startup, and an initial health check to validate the service is ready. Once the endpoint is ready, it begins accepting inference requests immediately.

What is model deployment?

Model deployment is the process of making a trained computer vision model available to receive and process real-world data. Once deployed, computer vision applications can send images and video frames to the model via API and receive predictions, enabling everything from automated quality inspection to real-time object detection in production systems. On Ultralytics Platform, deployment is integrated directly into the end-to-end training workflow. Once your model is trained, you can test it in the browser, deploy it to a dedicated endpoint in any of 43 global regions, and monitor its performance, all from the same workspace.

Start deploying today

Take your trained models to production across 43 global regions with auto-scaling and real-time monitoring.