Glossary

Vision Transformer (ViT)

Discover the power of Vision Transformers (ViTs) in computer vision. Learn how they outperform CNNs by capturing global image context.

A Vision Transformer (ViT) is a deep learning architecture that applies the principles of the original Transformer model directly to sequences of images. Originally introduced for Natural Language Processing (NLP), Transformers revolutionized the field by using mechanisms that allow the model to weigh the importance of different parts of the input data. The ViT was proposed by Google Research in the paper "An Image is Worth 16x16 Words" as an alternative to the standard Convolutional Neural Network (CNN) for visual tasks. Unlike CNNs, which process pixels using local filters, ViTs treat an image as a sequence of fixed-size patches, enabling them to capture global context and long-range dependencies from the very first layer using self-attention.

How Vision Transformers Work

The architecture of a ViT represents a significant shift in how machines process visual information. The workflow involves breaking down an image into smaller components that can be processed similarly to words in a sentence.

Patch Partitioning: The input image is divided into a grid of non-overlapping patches (e.g., 16x16 pixels). This step transforms the 2D image into a sequence of 1D vectors, effectively tokenizing the visual data.
Linear Projection of Flattened Patches: Each patch is flattened and projected into a lower-dimensional space, creating embeddings that represent the visual features of that specific area.
Positional Embeddings: Since the Transformer architecture does not inherently understand the order of the sequence, learnable positional embeddings are added to the patch embeddings to retain spatial information about where each patch is located in the original image.
Transformer Encoder: The sequence of embeddings is fed into a standard Transformer encoder. Here, the attention mechanism allows the model to learn relationships between every patch and every other patch, regardless of their distance from each other in the image.
Classification Head: For tasks like image classification, a special token is added to the sequence, and its final state is fed into a Multi-Layer Perceptron (MLP) head to predict the class label.

ViT Vs. CNN Architectures

While both architectures are fundamental to modern computer vision (CV), they rely on different inductive biases. CNNs utilize convolution operations that prioritize local interactions and translation invariance (recognizing an object regardless of its position). This makes CNNs highly efficient with smaller datasets. In contrast, ViTs have less image-specific structure and rely on learning patterns directly from massive datasets like ImageNet-21k.

ViTs generally excel when trained on very large amounts of data, as they can model complex global relationships that CNNs might miss. However, this global scope often comes at the cost of higher computational requirements for training and slower inference speeds on resource-constrained edge devices. Hybrid models like RT-DETR attempt to bridge this gap by combining a CNN backbone for efficient feature extraction with a Transformer encoder for global context.

Real-World Applications

Vision Transformers have found success in domains where understanding the holistic context of a scene is more critical than low-level texture details.

Medical Image Analysis: In fields like medical image analysis, ViTs are used to detect anomalies in MRI scans or X-rays. For example, in tumor detection, a ViT can correlate features from distant parts of an organ to identify malignant tissues that might look normal in isolation, improving diagnostic accuracy.
Remote Sensing and Satellite Imagery: ViTs are effectively used to analyze satellite imagery for environmental monitoring. Their ability to process global context helps in distinguishing between similar terrain types, such as differentiating between diverse crop fields or tracking urban expansion over large geographic areas.

Using Transformers with Ultralytics

The ultralytics package supports Transformer-based architectures like RT-DETR (Real-Time Detection Transformer), which leverages the strengths of ViTs for object detection. While CNN-based models like the recommended YOLO11 are typically faster for real-time applications, RT-DETR offers a robust alternative when high accuracy and global context are prioritized.

from ultralytics import RTDETR

# Load a pretrained RT-DETR model (Transformer-based architecture)
model = RTDETR("rtdetr-l.pt")

# Perform inference on an image to detect objects
results = model("https://ultralytics.com/images/bus.jpg")

# Display the results with bounding boxes
results[0].show()

Looking ahead, innovations in efficiency are crucial. Ultralytics is currently developing YOLO26, which aims to deliver the high accuracy associated with Transformers while maintaining the speed of CNNs. Additionally, the upcoming Ultralytics Platform will streamline the workflow for training and deploying these advanced models across various environments, from cloud servers to edge hardware. Major frameworks like PyTorch and TensorFlow continue to expand their support for ViT variants, driving further research in the field.

Vision Transformer (ViT)

Train Ultralytics YOLO models to streamline workflows across industries

Flexible enterprise licensing solution to power your innovation

Train AI models in seconds with Ultralytics YOLO

How Vision Transformers Work

ViT Vs. CNN Architectures

Real-World Applications

Using Transformers with Ultralytics

Read more in this category

Understanding why human-in-the-loop annotation is key

What is dataset distillation? A quick overview

Oakley Meta AI glasses are redefining eyewear with Vision AI

Join the Ultralytics community