Glossary

Depth Estimation

Discover how depth estimation creates depth maps from images—stereo, ToF, LiDAR, and monocular deep learning—to power robotics, AR/VR and 3D perception.

Depth estimation is a fundamental task in computer vision (CV) that involves determining the distance of objects in a scene relative to the camera. By calculating the depth value for each pixel in an image, this process transforms standard two-dimensional data into a rich 3D representation, often referred to as a depth map. This capability is essential for machines to perceive spatial relationships, enabling them to navigate environments, manipulate objects, and understand the geometry of the world much like the human visual system does.

Mechanisms of Depth Estimation

Estimating depth can be achieved through various methods, ranging from hardware-intensive active sensing to software-driven deep learning (DL) approaches.

Stereo Vision: Inspired by human binocular vision, stereo vision systems employ two cameras positioned slightly apart. By analyzing the disparity—the difference in horizontal position of an object between the left and right images—algorithms can mathematically triangulate the distance. This method relies heavily on reliable feature matching across frames.
Monocular Depth Estimation: This technique estimates depth from a single 2D image, a challenging task because a single image lacks explicit depth information. Modern Convolutional Neural Networks (CNNs) are trained on massive datasets to recognize monocular cues, such as object size, perspective, and occlusion. Research into monocular depth prediction has significantly advanced, allowing standard cameras to infer 3D structures.
Active Sensors (LiDAR and ToF): Unlike passive camera systems, active sensors emit signals to measure distance. LiDAR (Light Detection and Ranging) uses laser pulses to create precise 3D point clouds, while Time-of-Flight (ToF) cameras measure the time it takes for light to return to the sensor. These technologies provide high-accuracy ground truth data often used to train machine learning (ML) models.

Real-World Applications

The ability to perceive the third dimension unlocks critical functionality across various industries.

Autonomous Systems and Robotics

In the field of autonomous vehicles, depth estimation is vital for safety and navigation. Self-driving cars combine camera data with LiDAR to detect obstacles, estimate the distance to other vehicles, and construct a real-time map of the road. Similarly, in robotics, depth perception allows automated arms to perform "pick and place" operations by accurately judging the position and shape of items in manufacturing automation workflows.

Augmented Reality (AR)

For augmented reality experiences to be immersive, virtual objects must interact realistically with the physical world. Depth estimation enables mobile devices to understand the geometry of a room, allowing virtual furniture or characters to be placed on the floor or hidden behind real-world objects (occlusion), vastly improving the user experience.

Python Example: Distance Approximation with YOLO11

While dedicated depth models exist, developers often use 2D object detection alongside calibration data to approximate distance. The ultralytics library simplifies this via its solutions module, allowing users to estimate the distance of tracked objects based on their bounding box positions.

The following code demonstrates how to use YOLO11 to track objects and calculate their approximate distance from the camera.

import cv2
from ultralytics import YOLO, solutions

# Load the YOLO11 model for object detection
model = YOLO("yolo11n.pt")

# Initialize the DistanceCalculation solution
# This estimates distance based on bounding box centroids
dist_obj = solutions.DistanceCalculation(names=model.names, view_img=True)

# Open a video file or camera stream
cap = cv2.VideoCapture("path/to/video.mp4")

while cap.isOpened():
    success, im0 = cap.read()
    if not success:
        break

    # Track objects and calculate distance
    tracks = model.track(im0, persist=True, show=False)
    im0 = dist_obj.start_process(im0, tracks)

    # Display result (or save/process further)
    cv2.imshow("Distance Estimation", im0)
    if cv2.waitKey(1) == ord("q"):
        break

cap.release()
cv2.destroyAllWindows()