Glossary

Stochastic Gradient Descent (SGD)

Discover how Stochastic Gradient Descent optimizes machine learning models, enabling efficient training for large datasets and deep learning tasks.

Stochastic Gradient Descent (SGD) is a fundamental and widely used optimization algorithm in machine learning (ML). It is an iterative method used to train models by adjusting their internal parameters, such as weights and biases, to minimize a loss function. Unlike traditional Gradient Descent, which processes the entire dataset for each update, SGD updates the parameters using just a single, randomly selected training sample. This "stochastic" approach makes the training process significantly faster and more scalable, which is especially important when working with big data. The noisy updates can also help the model escape poor local minima in the error landscape and potentially find a better overall solution.

How Stochastic Gradient Descent Works

The core idea behind SGD is to approximate the true gradient of the loss function, which is calculated over the entire dataset, by using the gradient of the loss for a single sample. While this single-sample gradient is a noisy estimate, it is computationally cheap and, on average, points in the right direction. The process involves repeating a simple two-step cycle for each training sample:

Calculate the Gradient: Compute the gradient of the loss function with respect to the model's parameters for a single training example.
Update the Parameters: Adjust the parameters in the opposite direction of the gradient, scaled by a learning rate. This moves the model toward a state with lower error for that specific sample.

This cycle is repeated for many passes over the entire dataset, known as epochs, gradually improving the model's performance. The efficiency of SGD has made it a cornerstone of modern deep learning (DL), and it is supported by all major frameworks like PyTorch and TensorFlow.

Sgd Vs. Other Optimizers

SGD is one of several gradient-based optimization methods, each with its own trade-offs.

Batch Gradient Descent: This method calculates the gradient using the entire training dataset. It provides a stable and direct path to the minimum but is extremely slow and memory-intensive for large datasets, making it impractical for most modern applications.
Mini-Batch Gradient Descent: This is a compromise between Batch GD and SGD. It updates parameters using a small, random subset (a "mini-batch") of the data. It balances the stability of Batch GD with the efficiency of SGD and is the most common approach used in practice.
Adam Optimizer: Adam is an adaptive optimization algorithm that maintains a separate learning rate for each parameter and adjusts it as learning progresses. It often converges faster than SGD, but SGD can sometimes find a better minimum and offer better generalization, helping to prevent overfitting.

Real-World Applications

SGD and its variants are critical for training a wide array of AI models across different domains.

Real-time Object Detection Training: For models like Ultralytics YOLO designed for real-time inference, training needs to be efficient. SGD allows developers to train these models on large image datasets like COCO or custom datasets managed via platforms like Ultralytics HUB. The rapid updates enable faster convergence compared to Batch GD, crucial for iterating quickly during model development and hyperparameter tuning. This efficiency supports applications in fields like autonomous vehicles and robotics.
Training Large Language Models (LLMs): Training models for Natural Language Processing (NLP) often involves massive text datasets. SGD and its variants are essential for iterating through this data efficiently, allowing models such as GPT-4 or those found on Hugging Face to learn grammar, context, and semantics. The stochastic nature helps escape poor local minima in the complex loss landscape, a common challenge in training large neural networks. This process is foundational to tasks like machine translation and sentiment analysis.

Stochastic Gradient Descent (SGD)

Train Ultralytics YOLO models to streamline workflows across industries

Flexible enterprise licensing solution to power your innovation

Train AI models in seconds with Ultralytics YOLO

How Stochastic Gradient Descent Works

Sgd Vs. Other Optimizers

Real-World Applications

Read more in this category

Deploy Ultralytics YOLO models using the ExecuTorch integration

Key highlights from Ultralytics at PyTorch Conference 2025

Using self-supervised learning to denoise images

Join the Ultralytics community