Learn how depth estimation adds 3D perspective to computer vision. Explore techniques like monocular depth and stereo vision using Ultralytics YOLO26 models.
Depth estimation is a critical process in computer vision that determines the distance of objects from a camera, effectively adding a third dimension to 2D images. By calculating how far away every pixel in an image is, this technique creates a depth map, a representation where pixel intensity corresponds to distance. This capability mimics human binocular vision, allowing machines to perceive spatial relationships and geometry. It is a cornerstone technology for enabling autonomous systems to navigate safely, understand their environment, and interact with physical objects.
There are several ways to achieve depth estimation, ranging from hardware-based solutions to purely software-driven approaches using artificial intelligence.
The ability to gauge distance is transformative across many industries, powering applications that require spatial awareness.
While specialized depth models exist, you can often infer spatial relationships using object detection bounding boxes
as a proxy for distance in simple scenarios (larger boxes often mean closer objects). Here is how to load a model
using the ultralytics package to detect objects, which is the first step in many depth-aware pipelines.
from ultralytics import YOLO
# Load the YOLO26 model
model = YOLO("yolo26n.pt")
# Run inference on an image
results = model("path/to/image.jpg")
# Process results
for result in results:
# Get bounding boxes (xyxy format)
boxes = result.boxes.xyxy
# Iterate through detections
for box in boxes:
print(f"Detected object at: {box}")
It is important to distinguish depth estimation from related terms. While object detection identifies what and where an object is in 2D space (using a bounding box), depth estimation identifies how far away it is (Z-axis). Similarly, semantic segmentation classifies pixels into categories (e.g., road, sky, car), whereas depth estimation assigns a distance value to those same pixels.
Recent progress in generative AI is bridging the gap between 2D and 3D vision. Techniques like Neural Radiance Fields (NeRF) use multiple 2D images to reconstruct complex 3D scenes, relying heavily on underlying depth principles. Furthermore, as model optimization techniques improve, running highly accurate depth estimation on edge AI devices is becoming feasible. This enables real-time spatial computing on hardware as small as drones or smart glasses, facilitated by platforms like the Ultralytics Platform for efficient model training and deployment.