Explore the vanishing gradient problem and discover how it impacts deep learning. Learn about essential solutions like ReLU, skip connections, and [YOLO26](https://docs.ultralytics.com/models/yolo26/) to optimize training.
The Vanishing Gradient problem is a significant challenge encountered during the training of deep artificial neural networks. It occurs when the gradients—the values that dictate how much the network's parameters should change—become incredibly small as they propagate backward from the output layer to the input layers. Because these gradients are essential for updating the model weights, their disappearance means the earlier layers of the network stop learning. This phenomenon effectively prevents the model from capturing complex patterns in the data, limiting the depth and performance of deep learning architectures.
To understand why this happens, it is helpful to look at the process of backpropagation. During training, the network calculates the error between its prediction and the actual target using a loss function. This error is then sent backward through the layers to adjust the weights. This adjustment relies on the chain rule of calculus, which involves multiplying the derivatives of activation functions layer by layer.
If a network uses activation functions like the sigmoid function or the hyperbolic tangent (tanh), the derivatives are often less than 1. When many of these small numbers are multiplied together in a deep network with dozens or hundreds of layers, the result approaches zero. You can visualize this like a game of "telephone" where a message is whispered down a long line of people; by the time it reaches the start of the line, the message has become inaudible, and the first person doesn't know what to say.
인공지능 분야는 소실되는 기울기를 완화하기 위한 여러 강력한 전략을 개발하여 Ultralytics 같은 강력한 모델의 생성을 가능케 했습니다.
비록 동일한 근본적 메커니즘(반복적 곱셈)에서 비롯되지만, 소멸하는 기울기는 폭발하는 기울기와는 구별된다.
NaN (Not a Number). This is often fixed by
gradient clipping.
사라지는 기울기 문제를 극복하는 것은 현대 AI 애플리케이션의 성공을 위한 필수 조건이었다.
현대적인 프레임워크와 모델은 이러한 복잡성 대부분을 추상화합니다. YOLO26과 같은 모델을 훈련할 때, 아키텍처는 자동으로 SiLU 활성화 함수나 배치 정규화 같은 구성 요소를 포함하여 경사도 소멸을 방지합니다.
from ultralytics import YOLO
# Load the YOLO26 model (latest generation, Jan 2026)
# This architecture includes residual connections and modern activations
# that inherently prevent vanishing gradients.
model = YOLO("yolo26n.pt")
# Train the model on a dataset
# The optimization process remains stable due to the robust architecture
results = model.train(data="coco8.yaml", epochs=10)