Yolo Vision Shenzhen
Shenzhen
Join now
Glossary

Long Short-Term Memory (LSTM)

Discover how Long Short-Term Memory (LSTM) networks excel in handling sequential data, overcoming RNN limitations, and powering AI tasks like NLP and forecasting.

Long Short-Term Memory (LSTM) is a specialized type of Recurrent Neural Network (RNN) architecture designed to effectively learn and remember patterns over long sequences of data. Unlike standard RNNs that struggle with long-term dependencies because of the vanishing gradient problem, LSTMs use a unique gating mechanism to regulate the flow of information. This allows the network to selectively retain important information for extended periods while discarding irrelevant data. The foundational LSTM paper by Hochreiter and Schmidhuber laid the groundwork for this powerful technology, making it a cornerstone of modern deep learning, especially in Natural Language Processing (NLP) and time-series analysis.

How Lstms Work

The key to an LSTM's capability is its internal structure, which includes a "cell state" and several "gates." The cell state acts as a memory, carrying relevant information through the sequence. The gates—input, forget, and output—are neural networks that control what information is added to, removed from, or read from the cell state. A detailed visualization can be found in this popular Understanding LSTMs blog post.

  • Forget Gate: Decides which information from the previous cell state should be discarded.
  • Input Gate: Determines which new information from the current input should be stored in the cell state.
  • Output Gate: Controls what information from the cell state is used to generate the output for the current time step.

This gating structure enables LSTMs to maintain context over many time steps, a critical feature for understanding sequential data.

Real-World Applications

LSTMs have been successfully applied across numerous domains that involve sequential data.

  1. Machine Translation: LSTMs can process a sentence in one language word-by-word, build an internal representation, and then generate a translation in another language. This requires remembering the context from the beginning of the sentence to produce a coherent translation. Services like Google Translate historically used LSTM-based models for this purpose before transitioning to Transformer architectures.
  2. Speech Recognition: In speech-to-text applications, LSTMs can process sequences of audio features to transcribe spoken words. The model needs to consider previous sounds to correctly interpret the current one, demonstrating its ability to handle temporal dependencies. Many modern virtual assistants have relied on this technology for years.

Comparison With Other Sequence Models

LSTMs are part of a broader family of models for sequential data, each with distinct characteristics.

  • Gated Recurrent Unit (GRU): A GRU is a simplified version of an LSTM. It combines the forget and input gates into a single "update gate" and merges the cell state and hidden state. This makes GRUs computationally more efficient and faster to train, though they may be slightly less expressive than LSTMs on some tasks.
  • Recurrent Neural Network (RNN): A standard RNN is the predecessor to LSTMs. While it can process sequences, it has a simpler structure that makes it prone to the vanishing gradient problem, limiting its ability to learn long-range dependencies that LSTMs excel at capturing.
  • Transformer: The Transformer architecture, which relies on a self-attention mechanism, has largely surpassed LSTMs as the state-of-the-art for many NLP tasks. Unlike LSTMs' sequential processing, Transformers can process all elements of a sequence in parallel, making them highly efficient on modern hardware like GPUs and better at capturing global dependencies.

Implementation Example

LSTMs can be readily implemented using popular deep learning frameworks. The example below shows how to create a simple LSTM layer using PyTorch, the framework that powers Ultralytics models.

import torch
import torch.nn as nn

# Define LSTM layer: input_size=10, hidden_size=20, num_layers=2
lstm_layer = nn.LSTM(input_size=10, hidden_size=20, num_layers=2)

# Create a dummy input sequence: sequence_length=5, batch_size=3, input_size=10
input_sequence = torch.randn(5, 3, 10)

# Forward pass through the LSTM layer
output, (hidden_state, cell_state) = lstm_layer(input_sequence)

print("Output shape:", output.shape)
# Expected output shape: (sequence_length, batch_size, hidden_size) -> (5, 3, 20)

While Ultralytics primarily focuses on Computer Vision (CV) models like Ultralytics YOLO11 for tasks such as object detection and pose estimation, understanding sequence models like LSTMs is valuable. Research continues to explore ways of bridging NLP and CV for tasks like video understanding and image captioning. You can explore a wide range of machine learning models and concepts in the Ultralytics documentation. For a deeper dive into sequence models, courses from institutions like DeepLearning.AI offer comprehensive learning paths. Frameworks such as TensorFlow also provide robust LSTM implementations, which you can explore in the official TensorFlow Keras LSTM documentation.

Join the Ultralytics community

Join the future of AI. Connect, collaborate, and grow with global innovators

Join now