Yolo Vision Shenzhen
Шэньчжэнь
Присоединиться сейчас
Глоссарий

Синтетические данные

Раскройте возможности синтетических данных для AI/ML! Преодолейте нехватку данных, проблемы конфиденциальности и затраты, одновременно стимулируя обучение и инновации моделей.

Synthetic data is artificially generated information that mimics the statistical properties, patterns, and structural characteristics of real-world data. In the rapidly evolving fields of artificial intelligence (AI) and machine learning (ML), this data serves as a critical resource when collecting authentic data is expensive, time-consuming, or restricted by privacy regulations. Unlike organic data harvested from real-world events, synthetic data is algorithmically created using techniques such as computer simulations and advanced generative models. By 2030, industry analysts at Gartner predict that synthetic data will overshadow real data in AI models, fundamentally shifting how intelligent systems are built and deployed.

Роль синтетических данных в развитии искусственного интеллекта

The primary driver for utilizing synthetic datasets is to overcome the limitations inherent in traditional data collection and annotation. Training robust computer vision (CV) models often requires massive datasets containing diverse scenarios. When real-world data is scarce—such as in rare disease diagnosis or dangerous edge-case traffic accidents—synthetic data bridges the gap.

Generating this data allows developers to create perfectly labeled training data on demand. This includes precise bounding boxes for object detection or pixel-perfect masks for semantic segmentation, eliminating the human error often found in manual labeling processes. Furthermore, it addresses bias in AI by allowing engineers to deliberately balance datasets with underrepresented groups or environmental conditions, ensuring fairer model performance.

Применение в реальном мире

Synthetic data is revolutionizing industries where data privacy, safety, and scalability are paramount.

  • Autonomous Driving Simulations: Testing autonomous vehicles solely in the physical world is risky and geographically limited. Companies utilize photorealistic simulators, such as NVIDIA Omniverse, to train their perception systems. These simulators generate billions of virtual miles, exposing the AI to hazardous weather, erratic pedestrian behavior, and complex urban layouts that are difficult to capture consistently in the real world.
  • Healthcare and Medical Imaging: Patient privacy laws like HIPAA and GDPR strictly regulate the sharing of medical records. Synthetic data enables the creation of realistic medical image analysis datasets—such as X-rays or MRI scans—that retain the markers of pathology without containing any personally identifiable information. This allows researchers to train tumor detection models collaboratively without compromising patient confidentiality.

Generating Synthetic Data for Vision AI

Creating high-quality synthetic data often involves two main approaches: simulation engines and generative AI. Simulation engines, like the Unity Engine, use 3D graphics to render scenes with physics-based lighting and textures. Alternatively, generative models, such as Generative Adversarial Networks (GANs) and diffusion models, learn the distribution of real data to synthesize new, photorealistic examples.

Once a synthetic dataset is generated, it can be used to train high-performance models. The following Python example demonstrates how to load a model—potentially trained on synthetic data—using the ultralytics package to perform inference on an image.

from ultralytics import YOLO

# Load the YOLO26 model (latest stable generation for superior accuracy)
model = YOLO("yolo26n.pt")

# Run inference on a source image (this could be a synthetic validation image)
results = model("https://ultralytics.com/images/bus.jpg")

# Display the detection results to verify model performance
results[0].show()

Синтетические данные vs. Аугментация данных

It is helpful to distinguish synthetic data from data augmentation, as both techniques aim to expand datasets but function differently.

  • Data Augmentation involves applying transformations—such as flipping, rotation, cropping, or color adjustment—to existing real-world images to create slight variations. It relies on the original data source.
  • Synthetic Data involves the creation of entirely new data instances from scratch using algorithms or simulations. It does not strictly require an original image for every output, allowing for the generation of scenarios that have never been captured by a camera.

Modern workflows on the Ultralytics Platform often combine both approaches: using synthetic data to fill gaps in the dataset and applying data augmentation during training to maximize the robustness of models like YOLO26.

Присоединяйтесь к сообществу Ultralytics

Присоединяйтесь к будущему ИИ. Общайтесь, сотрудничайте и развивайтесь вместе с мировыми новаторами

Присоединиться сейчас