Libérez la puissance des données synthétiques pour l'IA/ML ! Surmontez la pénurie de données, les problèmes de confidentialité et les coûts, tout en stimulant la formation de modèles et l'innovation.
Synthetic data is artificially generated information that mimics the statistical properties, patterns, and structural characteristics of real-world data. In the rapidly evolving fields of artificial intelligence (AI) and machine learning (ML), this data serves as a critical resource when collecting authentic data is expensive, time-consuming, or restricted by privacy regulations. Unlike organic data harvested from real-world events, synthetic data is algorithmically created using techniques such as computer simulations and advanced generative models. By 2030, industry analysts at Gartner predict that synthetic data will overshadow real data in AI models, fundamentally shifting how intelligent systems are built and deployed.
The primary driver for utilizing synthetic datasets is to overcome the limitations inherent in traditional data collection and annotation. Training robust computer vision (CV) models often requires massive datasets containing diverse scenarios. When real-world data is scarce—such as in rare disease diagnosis or dangerous edge-case traffic accidents—synthetic data bridges the gap.
Generating this data allows developers to create perfectly labeled training data on demand. This includes precise bounding boxes for object detection or pixel-perfect masks for semantic segmentation, eliminating the human error often found in manual labeling processes. Furthermore, it addresses bias in AI by allowing engineers to deliberately balance datasets with underrepresented groups or environmental conditions, ensuring fairer model performance.
Synthetic data is revolutionizing industries where data privacy, safety, and scalability are paramount.
Creating high-quality synthetic data often involves two main approaches: simulation engines and generative AI. Simulation engines, like the Unity Engine, use 3D graphics to render scenes with physics-based lighting and textures. Alternatively, generative models, such as Generative Adversarial Networks (GANs) and diffusion models, learn the distribution of real data to synthesize new, photorealistic examples.
Once a synthetic dataset is generated, it can be used to train high-performance models. The following Python example
demonstrates how to load a model—potentially trained on synthetic data—using the ultralytics package to
perform inference on an image.
from ultralytics import YOLO
# Load the YOLO26 model (latest stable generation for superior accuracy)
model = YOLO("yolo26n.pt")
# Run inference on a source image (this could be a synthetic validation image)
results = model("https://ultralytics.com/images/bus.jpg")
# Display the detection results to verify model performance
results[0].show()
It is helpful to distinguish synthetic data from data augmentation, as both techniques aim to expand datasets but function differently.
Modern workflows on the Ultralytics Platform often combine both approaches: using synthetic data to fill gaps in the dataset and applying data augmentation during training to maximize the robustness of models like YOLO26.