Прунинг моделей — это метод машинного обучения (ML), используемый для оптимизации обученных моделей путем уменьшения их размера и сложности. Он включает в себя выявление и удаление менее важных параметров, таких как веса модели или связи внутри нейронной сети (NN), которые вносят минимальный вклад в общую производительность модели. Основная цель состоит в том, чтобы создать модели меньшего размера и более быстрые, требующие меньшей вычислительной мощности и памяти, часто без значительного снижения точности. Этот процесс является конкретным применением более широкой концепции прунинга, применяемой непосредственно к ML-моделям, что делает их более эффективными для развертывания.
Оптимизируйте модели машинного обучения с помощью обрезки моделей. Добейтесь более быстрого вывода, снижения использования памяти и энергоэффективности для развертываний с ограниченными ресурсами.
Model pruning is a technique in machine learning used to reduce the size and computational complexity of a
neural network by systematically removing
unnecessary parameters. Much like a gardener trims dead or overgrown branches to encourage a tree to thrive,
developers prune artificial networks to make them faster, smaller, and more energy-efficient. This process is
essential for deploying modern
deep learning architectures on devices with
limited resources, such as smartphones, embedded sensors, and edge computing hardware.
Как работает модельная обрезка
The core idea behind pruning is that deep neural networks are often "over-parameterized," meaning they
contain significantly more weights and biases than
are strictly necessary to solve a specific problem. During the training process, the model learns a vast number of
connections, but not all contribute equally to the final output. Pruning algorithms analyze the trained model to
identify these redundant or non-informative connections—typically those with weights close to zero—and remove them.
The lifecycle of a pruned model generally follows these steps:
- Training: A large model is trained to convergence to capture complex features.
-
Pruning: Low-importance parameters are set to zero or physically removed from the network
structure.
-
Fine-Tuning: The model undergoes a secondary round of
fine-tuning to allow the remaining parameters to
adjust and recover any accuracy lost during the pruning
phase.
This methodology is often associated with the
Lottery Ticket Hypothesis, which suggests that dense networks contain
smaller, isolated subnetworks (winning tickets) that can achieve comparable accuracy to the original model if trained
in isolation.
Типы стратегий обрезки
Pruning methods are generally categorized based on the structure of the components being removed.
-
Unstructured Pruning: This approach removes individual weights anywhere in the model based on a
threshold (e.g., magnitude). While this effectively reduces the parameter count, it results in
sparse matrices that can be difficult for standard
hardware to process efficiently. Without specialized software or hardware accelerators, unstructured pruning may not
yield significant speed improvements.
-
Structured Pruning: This method removes entire geometric structures, such as channels, filters, or
layers within a
convolutional neural network (CNN). By preserving the dense matrix structure, the pruned model remains compatible with standard
GPU and CPU hardware, leading to
direct improvements in inference latency and
throughput.
Применение в реальном мире
Pruning is a critical enabler for Edge AI, allowing
sophisticated models to run in environments where cloud connectivity is unavailable or too slow.
-
Mobile Object Detection: Applications on mobile devices, such as real-time language translation or
augmented reality, utilize pruned models to preserve battery life and reduce
memory usage. Optimized
architectures like YOLO26 are often preferred foundations
for these tasks due to their inherent efficiency.
-
Automotive Safety: Self-driving cars and
autonomous vehicles require split-second
decision-making. Pruned models allow onboard computers to process high-resolution camera feeds for pedestrian
detection without the latency induced by transmitting data to a server.
-
Industrial IoT: In manufacturing, visual inspection systems on assembly lines use lightweight
models to detect defects. Pruning ensures these systems can run on cost-effective microcontrollers rather than
expensive server racks.
Обрезка и связанные с ней методы оптимизации
While model pruning is a powerful tool, it is often confused with or used alongside other
model optimization techniques.
-
Pruning vs. Quantization: Pruning reduces the number of parameters (connections) in the
model. In contrast, model quantization reduces
the precision of those parameters, for example, by converting 32-bit floating-point numbers into 8-bit
integers. Both are often combined to maximize efficiency for
model deployment.
-
Pruning vs. Knowledge Distillation: Pruning modifies the original model by cutting parts out.
Knowledge distillation involves training a
completely new, smaller "student" model to mimic the behavior of a larger "teacher" model.
Пример реализации
The following Python example demonstrates how to apply unstructured pruning to a convolutional layer using
PyTorch. This is a common step before exporting models to
optimized formats like ONNX.
import torch
import torch.nn as nn
import torch.nn.utils.prune as prune
# Initialize a standard convolutional layer
module = nn.Conv2d(in_channels=1, out_channels=20, kernel_size=3)
# Apply unstructured pruning to remove 30% of the connections
# This sets the weights with the lowest L1-norm to zero
prune.l1_unstructured(module, name="weight", amount=0.3)
# Calculate and print the sparsity (percentage of zero elements)
sparsity = 100.0 * float(torch.sum(module.weight == 0)) / module.weight.nelement()
print(f"Layer Sparsity: {sparsity:.2f}%")
For users looking to manage the entire lifecycle of their datasets and models—including training, evaluation, and
deployment—the Ultralytics Platform offers a streamlined interface. It
simplifies the process of creating highly optimized models like
YOLO26 and exporting them to hardware-friendly formats such
as TensorRT or CoreML.