컴퓨터 비전에서 ViT(Vision Transformers)의 강력한 기능을 경험해 보세요. 전역 이미지 컨텍스트를 캡처하여 CNN보다 성능이 뛰어난 방법을 알아보세요.
A Vision Transformer (ViT) is a deep learning architecture that adapts the self-attention mechanisms originally designed for Natural Language Processing (NLP) to solve visual tasks. Unlike a traditional Convolutional Neural Network (CNN), which processes images through a hierarchy of local pixel grids, a ViT treats an image as a sequence of discrete patches. This approach was popularized by the landmark research paper "An Image is Worth 16x16 Words", which demonstrated that pure transformer architectures could achieve state-of-the-art performance in computer vision (CV) without relying on convolution layers. By leveraging global attention, ViTs can capture long-range dependencies across an entire image from the very first layer.
The fundamental innovation of the ViT is the way it structures input data. To make an image compatible with a standard Transformer, the model breaks the visual information down into a sequence of vectors, mimicking how a language model processes a sentence of words.
While both architectures aim to understand visual data, they differ significantly in their operational philosophy. CNNs possess a strong "inductive bias" known as translation invariance, meaning they inherently assume that local features (like edges and textures) are important regardless of their position. This makes CNNs highly data-efficient and effective on smaller datasets.
Conversely, Vision Transformers have less image-specific bias. They must learn spatial relationships from scratch using massive amounts of training data, such as the JFT-300M or full ImageNet datasets. While this makes training more computationally intensive, it allows ViTs to scale remarkably well; with sufficient data and compute power, they can outperform CNNs by capturing complex global structures that local convolutions might miss.
The ability to understand global context makes ViTs particularly useful for complex, high-stakes environments.
그리고 ultralytics library supports Transformer-based architectures, most notably the
RT-DETR (Real-Time Detection Transformer). While the
flagship YOLO26 is often preferred for its balance of speed
and accuracy on edge devices, RT-DETR offers a powerful alternative for scenarios prioritizing global context.
The following Python example demonstrates how to load a pre-trained Transformer-based model and run inference:
from ultralytics import RTDETR
# Load a pre-trained RT-DETR model (Vision Transformer-based)
model = RTDETR("rtdetr-l.pt")
# Run inference on an image source
# The model uses self-attention to detect objects globally
results = model("https://ultralytics.com/images/bus.jpg")
# Display the detection results
results[0].show()
Research is rapidly evolving to address the high computational cost of ViTs. Techniques like FlashAttention are making these models faster and more memory-efficient. Furthermore, hybrid architectures that combine the efficiency of CNNs with the attention of Transformers are becoming common. For teams looking to manage these advanced workflows, the Ultralytics Platform offers a unified environment to annotate data, train complex models via the cloud, and deploy them to diverse endpoints.