Glossary

Synthetic Data

Unlock the power of synthetic data for AI/ML! Overcome data scarcity, privacy issues, and costs while boosting model training and innovation.

Synthetic data refers to artificially generated information that mimics the statistical properties and patterns of real-world data. In the fields of machine learning (ML) and computer vision (CV), it serves as a powerful resource for developing high-performance models when obtaining authentic data is difficult, expensive, or restricted by privacy concerns. Unlike traditional datasets collected from physical events, synthetic data is programmed or simulated, allowing developers to create vast repositories of perfectly labeled training data on demand. Industry analysts at Gartner predict that by 2030, synthetic data will overshadow real data in AI models, driving a major shift in how intelligent systems are built.

How Synthetic Data Is Generated

Creating high-quality synthetic datasets involves sophisticated techniques that range from classic computer graphics to modern generative AI. These methods ensure that the artificial data is diverse enough to help models generalize well to new, unseen scenarios.

3D Simulation and Rendering: Game engines like Unity and Unreal Engine allow developers to build photorealistic virtual environments. Here, physics engines simulate light, gravity, and object interactions to produce images that look authentic. This is often used in conjunction with 3D object detection workflows.
Generative Models: Advanced algorithms such as Generative Adversarial Networks (GANs) and diffusion models learn the underlying structure of a small real-world dataset to generate infinite new variations. Tools like Stable Diffusion exemplify how these models can create complex visual data from scratch.
Domain Randomization: To prevent overfitting to a specific simulated look, developers use domain randomization. This technique varies parameters like lighting, texture, and camera angle wildly, forcing the AI to learn the essential features of an object rather than the background noise.

Real-World Applications

Synthetic data is revolutionizing industries where data collection is a bottleneck.

Autonomous Vehicles: Training self-driving cars requires exposing them to millions of driving scenarios, including rare and dangerous events like pedestrians darting into traffic or severe weather conditions. Collecting this data physically is unsafe. Companies like Waymo utilize simulation to test their autonomous vehicles across billions of virtual miles, refining their object detection systems without risking lives.
Healthcare and Medical Imaging: Patient records are protected by strict regulations such as HIPAA. Sharing real X-rays or MRI scans for research is often legally complex. Synthetic data allows researchers to generate realistic medical image analysis datasets that retain the statistical markers of diseases without containing any personally identifiable information (PII). This preserves data privacy while advancing diagnostic tools.

Synthetic Data vs. Data Augmentation

It is important to distinguish synthetic data from data augmentation, as both are used to enhance datasets.

Data Augmentation takes existing real-world images and modifies them—flipping, rotating, or changing color balance—to increase variety. You can read more about this in the YOLO data augmentation guide.
Synthetic Data is created from scratch. It does not rely on modifying a specific source image but generates entirely new instances, allowing for the creation of scenarios that may have never been captured by a camera.

Integration with Ultralytics YOLO

Synthetic datasets are formatted just like real datasets, usually with images and corresponding annotation files. You can seamlessly train state-of-the-art models like YOLO11 on this data to boost performance in niche tasks.

The following example demonstrates how to generate a simple synthetic image using code and run inference on it using the ultralytics package.

import cv2
import numpy as np
from ultralytics import YOLO

# 1. Generate a synthetic image (black background, white rectangle)
# This mimics a simple object generation process
synthetic_img = np.zeros((640, 640, 3), dtype=np.uint8)
cv2.rectangle(synthetic_img, (100, 100), (400, 400), (255, 255, 255), -1)

# 2. Load a pretrained YOLO11 model
model = YOLO("yolo11n.pt")

# 3. Run inference on the synthetic data
# The model attempts to detect objects within the generated image
results = model.predict(synthetic_img)

# Display result count
print(f"Detected {len(results[0].boxes)} objects in synthetic image.")

Synthetic Data

Train Ultralytics YOLO models to streamline workflows across industries

Flexible enterprise licensing solution to power your innovation

Train AI models in seconds with Ultralytics YOLO

How Synthetic Data Is Generated

Real-World Applications

Synthetic Data vs. Data Augmentation

Integration with Ultralytics YOLO

Read more in this category

Understanding why human-in-the-loop annotation is key

What is dataset distillation? A quick overview

Oakley Meta AI glasses are redefining eyewear with Vision AI

Join the Ultralytics community