深圳Yolo 视觉
深圳
立即加入
词汇表

模型量化

通过模型量化优化 AI 性能。 缩小尺寸、提高速度并提高能源效率,以实现实际部署。

Model quantization is a sophisticated model optimization technique used to reduce the computational and memory costs of running deep learning models. In standard training workflows, neural networks typically store parameters (weights and biases) and activation maps using 32-bit floating-point numbers (FP32). While this high precision ensures accurate calculations during training, it is often unnecessary for inference. Quantization converts these values into lower-precision formats, such as 16-bit floating-point (FP16) or 8-bit integers (INT8), effectively shrinking the model size and accelerating execution speed without significantly compromising accuracy.

Why Quantization Matters

The primary driver for quantization is the need to deploy powerful AI on resource-constrained hardware. As computer vision models like YOLO26 become more complex, their computational demands increase. Quantization addresses three critical bottlenecks:

  • Memory Footprint: By reducing the bit-width of weights (e.g., from 32-bit to 8-bit), the model's storage requirement is reduced by up to 4x. This is vital for mobile apps where application size is restricted.
  • Inference Latency: Lower precision operations are computationally cheaper. Modern processors, especially those with specialized neural processing units (NPUs), can execute INT8 operations much faster than FP32, significantly reducing inference latency.
  • Power Consumption: Moving less data through memory and performing simpler arithmetic operations consumes less energy, extending battery life in portable devices and autonomous vehicles.

与相关概念的比较

区分量化与其他优化技术至关重要,因为它们以截然不同的方式修改模型:

  • Quantization vs. Pruning: While quantization reduces the file size by lowering the bit-width of parameters, model pruning involves removing unnecessary connections (weights) entirely to create a sparse network. Pruning alters the model's structure, whereas quantization alters the data representation.
  • 量化与知识蒸馏: 知识蒸馏是一种训练技术,其中小型"学生"模型学习模仿大型"教师"模型。量化通常在蒸馏后应用于学生模型,以进一步提升边缘AI性能

实际应用

Quantization enables computer vision and AI across various industries where efficiency is paramount.

  1. Autonomous Systems: In the automotive industry, self-driving cars must process visual data from cameras and LiDAR in real-time. Quantized models deployed on NVIDIA TensorRT engines allow these vehicles to detect pedestrians and obstacles with millisecond latency, ensuring passenger safety.
  2. 智能农业:搭载多光谱相机的无人机运用量化目标检测模型,用于识别作物病害或监测生长阶段。这些模型在无人机嵌入式系统上本地运行,避免了偏远农田依赖不可靠的蜂窝网络连接。

用Ultralytics实现量化

The Ultralytics library simplifies the export process, allowing developers to convert models like the cutting-edge YOLO26 into quantized formats. The Ultralytics Platform also provides tools to manage these deployments seamlessly.

The following example demonstrates how to export a model to TFLite with INT8 quantization enabled. This process involves a calibration step where the model observes sample data to determine the optimal dynamic range for the quantized values.

from ultralytics import YOLO

# Load a standard YOLO26 model
model = YOLO("yolo26n.pt")

# Export to TFLite format with INT8 quantization
# The 'int8' argument triggers Post-Training Quantization
# 'data' provides the calibration dataset needed for mapping values
model.export(format="tflite", int8=True, data="coco8.yaml")

优化后的模型通常采用互操作标准进行部署,例如 ONNX 或高性能 推理引擎(如 OpenVINO,确保在多样化的硬件生态系统中实现广泛兼容性。

加入Ultralytics 社区

加入人工智能的未来。与全球创新者联系、协作和共同成长

立即加入