Découvrez comment la distillation des connaissances compresse les modèles d'IA pour une inférence plus rapide, une précision améliorée et une efficacité de déploiement des appareils en périphérie.
Knowledge distillation is a sophisticated technique in machine learning where a compact neural network, referred to as the "student," is trained to reproduce the behavior and performance of a larger, more complex network, known as the "teacher." The primary objective of this process is model optimization, allowing developers to transfer the predictive capabilities of heavy architectures into lightweight models suitable for deployment on resource-constrained hardware. By capturing the rich information encoded in the teacher's predictions, the student model often achieves significantly higher accuracy than if it were trained solely on the raw data, effectively bridging the gap between high performance and efficiency.
In traditional supervised learning, models are trained using "hard labels" from the training data, where an image is definitively categorized (e.g., 100% "dog" and 0% "cat"). However, a pre-trained teacher model produces a output via a softmax function that assigns probabilities to all classes. These probability distributions are known as "soft labels" or "dark knowledge."
For instance, if a teacher model analyzes an image of a wolf, it might predict 90% wolf, 9% dog, and 1% cat. This distribution reveals that the wolf shares visual features with a dog, context that a hard label ignores. During the distillation process, the student minimizes a loss function, such as the Kullback-Leibler divergence, to align its predictions with the teacher's soft labels. This method, popularized by Geoffrey Hinton's research, helps the student generalize better and reduces overfitting on smaller datasets.
Knowledge distillation is pivotal in industries where computational resources are scarce but high performance is non-negotiable.
It is important to differentiate knowledge distillation from other compression strategies, as they modify models in fundamentally different ways.
In a practical workflow, you first select a lightweight architecture to serve as the student. The Ultralytics Platform can be used to manage datasets and track the training experiments of these efficient models. Below is an example of initializing a compact YOLO26 model, which is ideal for edge deployment and serving as a student network:
from ultralytics import YOLO
# Load a lightweight YOLO26 nano model (acts as the student)
# The 'n' suffix denotes the nano version, optimized for speed
student_model = YOLO("yolo26n.pt")
# Train the model on a dataset
# In a custom distillation loop, the loss would be influenced by a teacher model
results = student_model.train(data="coco8.yaml", epochs=5, imgsz=640)