Yolo Vision Shenzhen
Shenzhen
Join now
Glossary

Vision Transformer (ViT)

Discover the power of Vision Transformers (ViTs) in computer vision. Learn how they outperform CNNs by capturing global image context.

A Vision Transformer (ViT) is a deep learning architecture that adapts the self-attention mechanisms originally designed for Natural Language Processing (NLP) to solve visual tasks. Unlike a traditional Convolutional Neural Network (CNN), which processes images through a hierarchy of local pixel grids, a ViT treats an image as a sequence of discrete patches. This approach was popularized by the landmark research paper "An Image is Worth 16x16 Words", which demonstrated that pure transformer architectures could achieve state-of-the-art performance in computer vision (CV) without relying on convolution layers. By leveraging global attention, ViTs can capture long-range dependencies across an entire image from the very first layer.

How Vision Transformers Work

The fundamental innovation of the ViT is the way it structures input data. To make an image compatible with a standard Transformer, the model breaks the visual information down into a sequence of vectors, mimicking how a language model processes a sentence of words.

  1. Patch Tokenization: The input image is divided into a grid of fixed-size squares, typically 16x16 pixels. Each square is flattened into a vector, effectively becoming a visual token.
  2. Linear Projection: These flattened patches are passed through a trainable linear layer to create dense embeddings. This step maps the raw pixel values into a high-dimensional space that the model can process.
  3. Positional Encoding: Since the architecture processes sequences in parallel and lacks an inherent understanding of order or space, learnable positional encodings are added to the patch embeddings. This allows the model to retain spatial information about where each patch belongs in the original image.
  4. Self-Attention Mechanism: The sequence enters the Transformer encoder, where self-attention allows every patch to interact with every other patch simultaneously. This enables the network to learn global context, understanding how a pixel in the top-left corner relates to one in the bottom-right.
  5. Classification Head: For tasks like image classification, a special "class token" is often prepended to the sequence. The final output state of this token serves as the aggregate representation of the image, which is then fed into a classifier, such as a multilayer perceptron (MLP).

Vision Transformers vs. CNNs

While both architectures aim to understand visual data, they differ significantly in their operational philosophy. CNNs possess a strong "inductive bias" known as translation invariance, meaning they inherently assume that local features (like edges and textures) are important regardless of their position. This makes CNNs highly data-efficient and effective on smaller datasets.

Conversely, Vision Transformers have less image-specific bias. They must learn spatial relationships from scratch using massive amounts of training data, such as the JFT-300M or full ImageNet datasets. While this makes training more computationally intensive, it allows ViTs to scale remarkably well; with sufficient data and compute power, they can outperform CNNs by capturing complex global structures that local convolutions might miss.

Real-World Applications

The ability to understand global context makes ViTs particularly useful for complex, high-stakes environments.

  • Medical Image Analysis: In healthcare AI, ViTs are used to analyze high-resolution scans like MRIs or histopathology slides. For instance, in tumor detection, a ViT can correlate subtle textural anomalies in tissue with broader structural changes across the slide, identifying malignant patterns that local processing might overlook.
  • Satellite Imagery and Remote Sensing: ViTs excel at satellite image analysis where the relationships between objects span large distances. For example, connecting a deforestation site to a distant logging road requires understanding the "big picture" of a landscape, a task where the global attention of a ViT outperforms the limited receptive field of standard CNNs.

Utilizing Transformers with Ultralytics

The ultralytics library supports Transformer-based architectures, most notably the RT-DETR (Real-Time Detection Transformer). While the flagship YOLO26 is often preferred for its balance of speed and accuracy on edge devices, RT-DETR offers a powerful alternative for scenarios prioritizing global context.

The following Python example demonstrates how to load a pre-trained Transformer-based model and run inference:

from ultralytics import RTDETR

# Load a pre-trained RT-DETR model (Vision Transformer-based)
model = RTDETR("rtdetr-l.pt")

# Run inference on an image source
# The model uses self-attention to detect objects globally
results = model("https://ultralytics.com/images/bus.jpg")

# Display the detection results
results[0].show()

Future Outlook

Research is rapidly evolving to address the high computational cost of ViTs. Techniques like FlashAttention are making these models faster and more memory-efficient. Furthermore, hybrid architectures that combine the efficiency of CNNs with the attention of Transformers are becoming common. For teams looking to manage these advanced workflows, the Ultralytics Platform offers a unified environment to annotate data, train complex models via the cloud, and deploy them to diverse endpoints.

Join the Ultralytics community

Join the future of AI. Connect, collaborate, and grow with global innovators

Join now