Узнайте, как оптимизатор Adam обеспечивает эффективное обучение нейронных сетей с адаптивной скоростью обучения, импульсом и реальными приложениями в области ИИ.
The Adam optimizer, short for Adaptive Moment Estimation, is a sophisticated optimization algorithm widely used to train deep learning models. It revolutionized the field by combining the advantages of two other popular extensions of stochastic gradient descent (SGD): Adaptive Gradient Algorithm (AdaGrad) and Root Mean Square Propagation (RMSProp). By computing individual adaptive learning rates for different parameters from estimates of the first and second moments of the gradients, Adam allows neural networks to converge significantly faster than traditional methods. Its robustness and minimal tuning requirements make it the default choice for many practitioners starting a new machine learning (ML) project.
At its core, training a model involves minimizing a loss function, which measures the difference between the model's predictions and the actual data. Standard algorithms typically use a constant step size (learning rate) to descend the "loss landscape" toward the minimum error. However, this landscape is often complex, featuring ravines and plateaus that can trap simpler algorithms.
Adam addresses this by maintaining two historical buffers for every parameter:
This combination allows the optimizer to take larger steps in flat areas of the landscape and smaller, more cautious steps in steep or noisy areas. The specific mechanics are detailed in the foundational Adam research paper by Kingma and Ba, which demonstrated its empirical superiority across various deep learning (DL) tasks.
The versatility of the Adam optimizer has led to its adoption across virtually all sectors of artificial intelligence (AI).
While Adam is generally faster to converge, it is important to distinguish it from Stochastic Gradient Descent (SGD). SGD updates model weights using a fixed learning rate and is often preferred for the final stages of training state-of-the-art object detection models because it can sometimes achieve slightly better generalization (final accuracy) on test data.
However, Adam is "adaptive," meaning it handles the tuning of the learning rate automatically. This makes it much more user-friendly for initial experiments and complex architectures where tuning SGD would be difficult. For users managing experiments on the Ultralytics Platform, switching between these optimizers to compare performance is often a key step in hyperparameter tuning.
Modern frameworks like PyTorch and the Ultralytics library make utilizing Adam straightforward. A popular variant called AdamW (Adam with Weight Decay) is often recommended as it fixes issues with regularization in the original Adam algorithm. This is particularly effective for the latest architectures like YOLO26, which benefits from the stability AdamW provides.
The following example demonstrates how to train a YOLO26 model using the AdamW optimizer:
from ultralytics import YOLO
# Load the cutting-edge YOLO26n model
model = YOLO("yolo26n.pt")
# Train the model using the 'AdamW' optimizer
# The 'optimizer' argument allows easy switching between SGD, Adam, AdamW, etc.
results = model.train(data="coco8.yaml", epochs=5, optimizer="AdamW")
For developers interested in the deeper theoretical underpinnings, resources like the Stanford CS231n Optimization Notes provide excellent visualizations of how Adam compares to other algorithms like RMSProp and AdaGrad. Additionally, the PyTorch Optimizer Documentation offers technical details on the arguments and implementation specifics available for customization.