Yolo Vision Shenzhen
Shenzhen
Join now
Glossary

Depth Estimation

Discover how depth estimation creates depth maps from images—stereo, ToF, LiDAR, and monocular deep learning—to power robotics, AR/VR and 3D perception.

Depth estimation is a fundamental task in computer vision (CV) that involves determining the distance of objects in a scene relative to the camera. By calculating the depth value for each pixel in an image, this process transforms standard two-dimensional data into a rich 3D representation, often referred to as a depth map. This capability is essential for machines to perceive spatial relationships, enabling them to navigate environments, manipulate objects, and understand the geometry of the world much like the human visual system does.

Mechanisms of Depth Estimation

Estimating depth can be achieved through various methods, ranging from hardware-intensive active sensing to software-driven deep learning (DL) approaches.

  • Stereo Vision: Inspired by human binocular vision, stereo vision systems employ two cameras positioned slightly apart. By analyzing the disparity—the difference in horizontal position of an object between the left and right images—algorithms can mathematically triangulate the distance. This method relies heavily on reliable feature matching across frames.
  • Monocular Depth Estimation: This technique estimates depth from a single 2D image, a challenging task because a single image lacks explicit depth information. Modern Convolutional Neural Networks (CNNs) are trained on massive datasets to recognize monocular cues, such as object size, perspective, and occlusion. Research into monocular depth prediction has significantly advanced, allowing standard cameras to infer 3D structures.
  • Active Sensors (LiDAR and ToF): Unlike passive camera systems, active sensors emit signals to measure distance. LiDAR (Light Detection and Ranging) uses laser pulses to create precise 3D point clouds, while Time-of-Flight (ToF) cameras measure the time it takes for light to return to the sensor. These technologies provide high-accuracy ground truth data often used to train machine learning (ML) models.

Real-World Applications

The ability to perceive the third dimension unlocks critical functionality across various industries.

Autonomous Systems and Robotics

In the field of autonomous vehicles, depth estimation is vital for safety and navigation. Self-driving cars combine camera data with LiDAR to detect obstacles, estimate the distance to other vehicles, and construct a real-time map of the road. Similarly, in robotics, depth perception allows automated arms to perform "pick and place" operations by accurately judging the position and shape of items in manufacturing automation workflows.

Augmented Reality (AR)

For augmented reality experiences to be immersive, virtual objects must interact realistically with the physical world. Depth estimation enables mobile devices to understand the geometry of a room, allowing virtual furniture or characters to be placed on the floor or hidden behind real-world objects (occlusion), vastly improving the user experience.

Python Example: Distance Approximation with YOLO11

While dedicated depth models exist, developers often use 2D object detection alongside calibration data to approximate distance. The ultralytics library simplifies this via its solutions module, allowing users to estimate the distance of tracked objects based on their bounding box positions.

The following code demonstrates how to use YOLO11 to track objects and calculate their approximate distance from the camera.

import cv2
from ultralytics import YOLO, solutions

# Load the YOLO11 model for object detection
model = YOLO("yolo11n.pt")

# Initialize the DistanceCalculation solution
# This estimates distance based on bounding box centroids
dist_obj = solutions.DistanceCalculation(names=model.names, view_img=True)

# Open a video file or camera stream
cap = cv2.VideoCapture("path/to/video.mp4")

while cap.isOpened():
    success, im0 = cap.read()
    if not success:
        break

    # Track objects and calculate distance
    tracks = model.track(im0, persist=True, show=False)
    im0 = dist_obj.start_process(im0, tracks)

    # Display result (or save/process further)
    cv2.imshow("Distance Estimation", im0)
    if cv2.waitKey(1) == ord("q"):
        break

cap.release()
cv2.destroyAllWindows()

Depth Estimation vs. Related Concepts

It is important to distinguish depth estimation from similar terms in the AI ecosystem:

  • 3D Object Detection: While depth estimation assigns a distance value to every pixel, 3D object detection focuses on identifying specific objects and placing a 3D bounding box (cuboid) around them. 3D detection tells you "where the car is and its size," whereas depth estimation provides the geometry of the entire scene, including the road and background.
  • Distance Calculation: This usually refers to measuring the linear space between two specific points or from the camera to a distinct object (often using 2D heuristics), as shown in the code example above. Depth estimation is a dense, pixel-wise prediction task that generates a complete topographical map of the view.
  • Optical Flow: This measures the apparent motion of objects between frames. While optical flow can be used to infer depth (structure from motion), its primary output is a motion vector field, not a static distance map.

Advancing Spatial AI

Recent advancements in Generative AI and foundational models are further bridging the gap between 2D and 3D. Techniques like Neural Radiance Fields (NeRF) use sparse 2D images to reconstruct complex 3D scenes, relying heavily on underlying depth principles. As model optimization improves, highly accurate depth estimation is becoming feasible on edge devices, powering the next generation of smart drones, service robots, and spatial computing devices.

Join the Ultralytics community

Join the future of AI. Connect, collaborate, and grow with global innovators

Join now