Yolo Vision Shenzhen
Shenzhen
Iscriviti ora
Glossario

Quantizzazione del modello

Ottimizza le prestazioni dell'IA con la quantizzazione del modello. Riduci le dimensioni, aumenta la velocità e migliora l'efficienza energetica per le implementazioni nel mondo reale.

Model quantization is a sophisticated model optimization technique used to reduce the computational and memory costs of running deep learning models. In standard training workflows, neural networks typically store parameters (weights and biases) and activation maps using 32-bit floating-point numbers (FP32). While this high precision ensures accurate calculations during training, it is often unnecessary for inference. Quantization converts these values into lower-precision formats, such as 16-bit floating-point (FP16) or 8-bit integers (INT8), effectively shrinking the model size and accelerating execution speed without significantly compromising accuracy.

Why Quantization Matters

The primary driver for quantization is the need to deploy powerful AI on resource-constrained hardware. As computer vision models like YOLO26 become more complex, their computational demands increase. Quantization addresses three critical bottlenecks:

  • Memory Footprint: By reducing the bit-width of weights (e.g., from 32-bit to 8-bit), the model's storage requirement is reduced by up to 4x. This is vital for mobile apps where application size is restricted.
  • Inference Latency: Lower precision operations are computationally cheaper. Modern processors, especially those with specialized neural processing units (NPUs), can execute INT8 operations much faster than FP32, significantly reducing inference latency.
  • Power Consumption: Moving less data through memory and performing simpler arithmetic operations consumes less energy, extending battery life in portable devices and autonomous vehicles.

Confronto con concetti correlati

È importante differenziare la quantizzazione dalle altre tecniche di ottimizzazione, poiché modificano il modello in modi distinti :

  • Quantization vs. Pruning: While quantization reduces the file size by lowering the bit-width of parameters, model pruning involves removing unnecessary connections (weights) entirely to create a sparse network. Pruning alters the model's structure, whereas quantization alters the data representation.
  • Quantizzazione vs. Distillazione della conoscenza: La distillazione della conoscenza è una tecnica di addestramento in cui un piccolo modello "studente" impara a imitare un grande modello "insegnante". La quantizzazione viene spesso applicata al modello studente dopo la distillazione per migliorare ulteriormente le prestazioni dell'AI edge.

Applicazioni nel mondo reale

Quantization enables computer vision and AI across various industries where efficiency is paramount.

  1. Autonomous Systems: In the automotive industry, self-driving cars must process visual data from cameras and LiDAR in real-time. Quantized models deployed on NVIDIA TensorRT engines allow these vehicles to detect pedestrians and obstacles with millisecond latency, ensuring passenger safety.
  2. Agricoltura intelligente: i droni dotati di telecamere multispettrali utilizzano modelli quantizzati di rilevamento degli oggetti per identificare le malattie delle colture o monitorare le fasi di crescita. L'esecuzione di questi modelli a livello locale sui sistemi integratidei droni elimina la necessità di connessioni cellulari inaffidabili nei campi remoti.

Implementazione della quantizzazione con Ultralytics

The Ultralytics library simplifies the export process, allowing developers to convert models like the cutting-edge YOLO26 into quantized formats. The Ultralytics Platform also provides tools to manage these deployments seamlessly.

The following example demonstrates how to export a model to TFLite with INT8 quantization enabled. This process involves a calibration step where the model observes sample data to determine the optimal dynamic range for the quantized values.

from ultralytics import YOLO

# Load a standard YOLO26 model
model = YOLO("yolo26n.pt")

# Export to TFLite format with INT8 quantization
# The 'int8' argument triggers Post-Training Quantization
# 'data' provides the calibration dataset needed for mapping values
model.export(format="tflite", int8=True, data="coco8.yaml")

I modelli ottimizzati vengono spesso implementati utilizzando standard interoperabili come ONNX o motori di inferenza ad alte prestazioni come OpenVINO, garantendo un'ampia compatibilità tra diversi ecosistemi hardware.

Unitevi alla comunità di Ultralytics

Entra nel futuro dell'AI. Connettiti, collabora e cresci con innovatori globali

Iscriviti ora