Discover how depth estimation creates depth maps from images—stereo, ToF, LiDAR, and monocular deep learning—to power robotics, AR/VR and 3D perception.
Depth estimation is a fundamental task in computer vision (CV) that involves determining the distance of objects in a scene relative to the camera. By calculating the depth value for each pixel in an image, this process transforms standard two-dimensional data into a rich 3D representation, often referred to as a depth map. This capability is essential for machines to perceive spatial relationships, enabling them to navigate environments, manipulate objects, and understand the geometry of the world much like the human visual system does.
Estimating depth can be achieved through various methods, ranging from hardware-intensive active sensing to software-driven deep learning (DL) approaches.
The ability to perceive the third dimension unlocks critical functionality across various industries.
In the field of autonomous vehicles, depth estimation is vital for safety and navigation. Self-driving cars combine camera data with LiDAR to detect obstacles, estimate the distance to other vehicles, and construct a real-time map of the road. Similarly, in robotics, depth perception allows automated arms to perform "pick and place" operations by accurately judging the position and shape of items in manufacturing automation workflows.
For augmented reality experiences to be immersive, virtual objects must interact realistically with the physical world. Depth estimation enables mobile devices to understand the geometry of a room, allowing virtual furniture or characters to be placed on the floor or hidden behind real-world objects (occlusion), vastly improving the user experience.
While dedicated depth models exist, developers often use 2D
object detection alongside calibration data to
approximate distance. The ultralytics library simplifies this via its solutions module, allowing users to
estimate the distance of tracked objects based on their bounding box positions.
The following code demonstrates how to use YOLO11 to track objects and calculate their approximate distance from the camera.
import cv2
from ultralytics import YOLO, solutions
# Load the YOLO11 model for object detection
model = YOLO("yolo11n.pt")
# Initialize the DistanceCalculation solution
# This estimates distance based on bounding box centroids
dist_obj = solutions.DistanceCalculation(names=model.names, view_img=True)
# Open a video file or camera stream
cap = cv2.VideoCapture("path/to/video.mp4")
while cap.isOpened():
success, im0 = cap.read()
if not success:
break
# Track objects and calculate distance
tracks = model.track(im0, persist=True, show=False)
im0 = dist_obj.start_process(im0, tracks)
# Display result (or save/process further)
cv2.imshow("Distance Estimation", im0)
if cv2.waitKey(1) == ord("q"):
break
cap.release()
cv2.destroyAllWindows()
It is important to distinguish depth estimation from similar terms in the AI ecosystem:
Recent advancements in Generative AI and foundational models are further bridging the gap between 2D and 3D. Techniques like Neural Radiance Fields (NeRF) use sparse 2D images to reconstruct complex 3D scenes, relying heavily on underlying depth principles. As model optimization improves, highly accurate depth estimation is becoming feasible on edge devices, powering the next generation of smart drones, service robots, and spatial computing devices.