Glossary

Transformer

Discover how Transformer architectures revolutionize AI, powering breakthroughs in NLP, computer vision, and advanced ML tasks.

A Transformer is a revolutionary neural network architecture that has become a cornerstone of modern Artificial Intelligence (AI), especially in Natural Language Processing (NLP) and, more recently, Computer Vision (CV). Introduced by Google researchers in the 2017 paper "Attention Is All You Need", its key innovation is the self-attention mechanism, which allows the model to weigh the importance of different words or parts of an input sequence. This enables it to capture long-range dependencies and contextual relationships more effectively than previous architectures. The design also allows for massive parallelization, making it possible to train much larger models on massive datasets, leading to the rise of Large Language Models (LLMs).

How Transformers Work

Unlike sequential models such as Recurrent Neural Networks (RNNs), Transformers process entire sequences of data at once. The core idea is to handle all elements in parallel, which significantly speeds up training on modern hardware like GPUs.

To understand the sequence order without recurrence, Transformers use a technique called positional encoding, which adds information about the position of each element (e.g., a word in a sentence) to its embedding. The self-attention layers then process these embeddings, allowing every element to "look at" every other element in the sequence and determine which ones are most relevant for understanding its meaning. This global context awareness is a major advantage for complex tasks. Frameworks like PyTorch and TensorFlow provide extensive support for building Transformer-based models.

Applications of Transformers

The impact of Transformers spans numerous domains, driving progress in both language and vision tasks.

Language Translation and Generation: Services like Google Translate use Transformer-based models for high-quality machine translation. The model can consider the entire source sentence to produce a more fluent and accurate translation. Similarly, models like GPT-4 excel at text generation by understanding context to create coherent paragraphs, write articles, or power advanced chatbots.
Computer Vision: The Vision Transformer (ViT) adapts the architecture for image-based tasks. It treats an image as a sequence of patches and uses self-attention to model relationships between them. This approach is used in models like RT-DETR for object detection, where understanding the global context of a scene can help identify objects more accurately, especially in cluttered environments. You can see a comparison of RT-DETR and YOLOv8 to understand their architectural differences.

Transformer Vs. Other Architectures

It's helpful to distinguish Transformers from other common neural network architectures:

Transformers vs. RNNs: RNNs process data sequentially, which makes them inherently slow and susceptible to the vanishing gradient problem, causing them to forget earlier information in long sequences. Transformers overcome this with parallel processing and self-attention, capturing long-range dependencies far more effectively.
Transformers vs. CNNs: Convolutional Neural Networks (CNNs) are highly efficient for vision tasks, using convolutional filters to identify local patterns in grid-like data like pixels. They are the foundation for models like the Ultralytics YOLO family. Transformers, in contrast, capture global relationships but often require more data and compute resources. Hybrid models, which combine a CNN backbone with Transformer layers, aim to get the best of both worlds.

Efficient Transformer Variants

The computational cost of the original Transformer's full self-attention grows quadratically with sequence length, making it challenging for very long sequences. This has led to the development of more efficient variants.

Longformer: Uses a sliding window attention mechanism combined with global attention on specific tokens to reduce computational complexity.
Reformer: Employs techniques like locality-sensitive hashing to approximate full attention, making it more memory-efficient.
Transformer-XL: Introduces a recurrence mechanism that allows the model to learn dependencies beyond a fixed length, which is particularly useful for auto-regressive language modeling.

These advancements continue to expand the applicability of Transformers to new problems. Tools and platforms like Hugging Face and Ultralytics HUB make it easier for developers to access and deploy these powerful models.

Transformer

Flexible enterprise licensing solution to power your innovation

Train AI models in seconds with Ultralytics YOLO

Train YOLO models simply with Ultralytics HUB

How Transformers Work

Applications of Transformers

Transformer Vs. Other Architectures

Efficient Transformer Variants

Read more in this category

Industrial Internet of things (IIoT) explained

Key highlights from Ultralytics at WAIC 2025 in Shanghai

How is tea made using technologies like Vision AI?

Join the Ultralytics community