Yolo 深圳
深セン
今すぐ参加
用語集

モデル量子化

モデル量子化でAIのパフォーマンスを最適化します。サイズを縮小し、速度を向上させ、エネルギー効率を高めて、実世界での展開を改善します。

Model quantization is a sophisticated model optimization technique used to reduce the computational and memory costs of running deep learning models. In standard training workflows, neural networks typically store parameters (weights and biases) and activation maps using 32-bit floating-point numbers (FP32). While this high precision ensures accurate calculations during training, it is often unnecessary for inference. Quantization converts these values into lower-precision formats, such as 16-bit floating-point (FP16) or 8-bit integers (INT8), effectively shrinking the model size and accelerating execution speed without significantly compromising accuracy.

Why Quantization Matters

The primary driver for quantization is the need to deploy powerful AI on resource-constrained hardware. As computer vision models like YOLO26 become more complex, their computational demands increase. Quantization addresses three critical bottlenecks:

  • Memory Footprint: By reducing the bit-width of weights (e.g., from 32-bit to 8-bit), the model's storage requirement is reduced by up to 4x. This is vital for mobile apps where application size is restricted.
  • Inference Latency: Lower precision operations are computationally cheaper. Modern processors, especially those with specialized neural processing units (NPUs), can execute INT8 operations much faster than FP32, significantly reducing inference latency.
  • Power Consumption: Moving less data through memory and performing simpler arithmetic operations consumes less energy, extending battery life in portable devices and autonomous vehicles.

関連概念との比較

量子化は他の最適化手法とは区別することが重要である。なぜなら、それらはモデルを異なる方法で変更するからである:

  • Quantization vs. Pruning: While quantization reduces the file size by lowering the bit-width of parameters, model pruning involves removing unnecessary connections (weights) entirely to create a sparse network. Pruning alters the model's structure, whereas quantization alters the data representation.
  • 量子化と知識蒸留の比較: 知識蒸留とは、小さな「生徒」モデルが大きな「教師」モデルを模倣するように学習する トレーニング手法である。量子化は蒸留後に生徒モデルに適用されることが多く、 エッジAIの性能をさらに向上させる。

実際のアプリケーション

Quantization enables computer vision and AI across various industries where efficiency is paramount.

  1. Autonomous Systems: In the automotive industry, self-driving cars must process visual data from cameras and LiDAR in real-time. Quantized models deployed on NVIDIA TensorRT engines allow these vehicles to detect pedestrians and obstacles with millisecond latency, ensuring passenger safety.
  2. スマート農業:マルチスペクトルカメラを搭載したドローンは、量子化された物体検出モデルを用いて作物の病害を特定したり生育段階を監視したりする。これらのモデルをドローンの組み込みシステム上でローカルに実行することで、遠隔地における不安定な携帯電話回線への依存を解消する。

Ultralytics量子化の実装

The Ultralytics library simplifies the export process, allowing developers to convert models like the cutting-edge YOLO26 into quantized formats. The Ultralytics Platform also provides tools to manage these deployments seamlessly.

The following example demonstrates how to export a model to TFLite with INT8 quantization enabled. This process involves a calibration step where the model observes sample data to determine the optimal dynamic range for the quantized values.

from ultralytics import YOLO

# Load a standard YOLO26 model
model = YOLO("yolo26n.pt")

# Export to TFLite format with INT8 quantization
# The 'int8' argument triggers Post-Training Quantization
# 'data' provides the calibration dataset needed for mapping values
model.export(format="tflite", int8=True, data="coco8.yaml")

最適化されたモデルは、 ONNX や、 OpenVINO OpenVINOなどの高性能推論エンジンを用いてデプロイされ、多様なハードウェアエコシステム間で広範な互換性を確保します。

Ultralytics コミュニティに参加する

AIの未来を共に切り開きましょう。グローバルなイノベーターと繋がり、協力し、成長を。

今すぐ参加