Glossary

Gradient Descent

Discover how Gradient Descent optimizes AI models like Ultralytics YOLO, enabling accurate predictions in tasks from healthcare to self-driving cars.

Gradient Descent is a fundamental optimization algorithm at the heart of most machine learning (ML) and deep learning models. Its primary goal is to minimize a model's error by iteratively adjusting its internal parameters. Imagine standing on a foggy mountain and trying to find the lowest point. You would look around your feet to see which direction slopes downward most steeply and take a step in that direction. By repeating this process, you will eventually reach a valley. In machine learning, the "mountain" is the loss function, the "direction" is the negative gradient of the loss function, and the "step size" is the learning rate.

How Gradient Descent Works

The training process for a neural network involves finding the optimal set of model weights that result in the lowest possible error, or loss. Gradient Descent automates this search. The process begins by calculating the gradient—a measure of how much the loss changes with respect to each weight. This calculation is typically performed using the backpropagation algorithm. The weights are then updated by taking a small step in the opposite direction of the gradient, effectively moving "downhill" on the loss surface. This iterative process continues for many epochs until the model's performance converges and the loss is minimized. The size of each step is a critical factor determined by the learning rate, a key setting in hyperparameter tuning. A detailed overview of this process can be found in resources like the Stanford CS231n course notes.

Types of Gradient Descent

There are three main variations of Gradient Descent, each differing in how much data is used to compute the gradient for each weight update:

Batch Gradient Descent (BGD): Calculates the gradient using the entire training dataset. This approach provides a stable and accurate gradient, but it is computationally very expensive and memory-intensive, making it impractical for large datasets like ImageNet.
Stochastic Gradient Descent (SGD): Updates the weights after processing just a single, randomly chosen data sample. It is much faster and less memory-intensive than BGD, but the updates are noisy, leading to a more erratic convergence path. This randomness can sometimes help the model escape poor local minima.
Mini-Batch Gradient Descent: Strikes a balance by computing the gradient on a small, random subset (a "mini-batch") of the data, typically between 32 and 256 samples. This is the most common approach used in modern deep learning because it combines the efficiency of SGD with the stability of BGD. Frameworks like PyTorch and TensorFlow use this method by default in their optimizers. For an in-depth comparison, see this overview of gradient descent algorithms.

Gradient Descent vs. Related Concepts

It is important to distinguish Gradient Descent from several related terms:

Optimization Algorithms: Gradient Descent is the foundational concept for a family of optimization algorithms. More advanced optimizers, such as the Adam optimizer or RMSprop, are adaptive variations that adjust the learning rate for each parameter individually, often leading to faster convergence. You can find the original Adam paper on arXiv.
Backpropagation: Backpropagation and Gradient Descent are two distinct but connected processes. Backpropagation is the algorithm that efficiently computes the gradients of the loss function with respect to the network's weights. Gradient Descent is the algorithm that then uses these gradients to update the weights.
Loss Function: The loss function defines the objective that Gradient Descent aims to minimize. The choice of loss function (e.g., Cross-Entropy for image classification) creates the specific error landscape that the optimizer navigates.

Real-World Applications

Gradient Descent is the engine that powers the training of countless AI models.

Training Object Detection Models: When an Ultralytics YOLO model is trained for object detection on a large-scale dataset like COCO, mini-batch gradient descent is used in every iteration. The model predicts bounding boxes, a loss is calculated based on the error, and Gradient Descent adjusts millions of weights throughout the model's backbone and head to improve accuracy. This entire workflow can be managed and scaled using platforms like Ultralytics HUB.
Training Language Models: In Natural Language Processing (NLP), models like BERT are trained for tasks like sentiment analysis. Gradient Descent minimizes a loss function that measures the difference between the model's predicted sentiment and the true label, enabling the model to learn the nuances of human language from vast text corpora. The Stanford NLP Group provides extensive research in this area.

Challenges and Considerations

While powerful, Gradient Descent is not without its challenges. The algorithm can get stuck in local minima—valleys that are not the absolute lowest point on the loss surface. In very deep networks, it can also suffer from the vanishing gradient or exploding gradient problems, where the gradient becomes too small or too large to effectively update the weights. Careful selection of the learning rate, choice of a robust optimizer, and techniques like batch normalization are crucial for successful training, as detailed in our model training tips guide.

Gradient Descent

Train Ultralytics YOLO models to streamline workflows across industries

Flexible enterprise licensing solution to power your innovation

Train AI models in seconds with Ultralytics YOLO

How Gradient Descent Works

Types of Gradient Descent

Gradient Descent vs. Related Concepts

Real-World Applications

Challenges and Considerations

Read more in this category

Key highlights from Ultralytics at PyTorch Conference 2025

Using self-supervised learning to denoise images

Vision AI powers driver attention monitoring systems

Join the Ultralytics community