Узнайте, как архитектуры Transformer революционизируют искусственный интеллект, обеспечивая прорывы в NLP, компьютерном зрении и передовых задачах машинного обучения.
A Transformer is a deep learning architecture that relies on a mechanism called self-attention to process sequential input data, such as natural language or visual features. Originally introduced by Google researchers in the landmark paper Attention Is All You Need, the Transformer revolutionized the field of artificial intelligence (AI) by discarding the sequential processing limitations of earlier Recurrent Neural Networks (RNNs). Instead, Transformers analyze entire sequences of data simultaneously, allowing for massive parallelization and significantly faster training times on modern hardware like GPUs.
The core innovation of the Transformer is the self-attention mechanism. This allows the model to weigh the importance of different parts of the input data relative to each other. For instance, in a sentence, the model can learn that the word "bank" relates more closely to "money" than to "river" based on the surrounding context.
This architecture generally consists of two main components:
In the realm of computer vision (CV), models usually employ a variation called the Vision Transformer (ViT). Instead of processing text tokens, the image is split into fixed-size patches (e.g., 16x16 pixels). These patches are flattened and treated as a sequence, enabling the model to capture "global context"—understanding relationships between distant parts of an image—more effectively than a standard Convolutional Neural Network (CNN).
It is important to distinguish the Transformer architecture from related terms:
The versatility of Transformers has led to their adoption across various industries:
While CNNs have traditionally dominated object detection, Transformer-based models like the Real-Time Detection Transformer (RT-DETR) have emerged as powerful alternatives. RT-DETR combines the speed of CNN backbones with the precision of Transformer decoding heads.
However, pure Transformer models can be computationally heavy. For many edge applications, highly optimized hybrid models like YOLO26—which integrate efficient attention mechanisms with rapid convolutional processing—offer a superior balance of speed and accuracy. You can manage the training and deployment of these models easily via the Ultralytics Platform, which streamlines the workflow from dataset annotation to model export.
The following example demonstrates how to perform inference using a Transformer-based model within the
ultralytics package. This code loads a pre-trained RT-DETR model and detects objects in an image.
from ultralytics import RTDETR
# Load a pre-trained Real-Time Detection Transformer (RT-DETR) model
model = RTDETR("rtdetr-l.pt")
# Run inference on an image URL
# The model uses self-attention to identify objects with high accuracy
results = model("https://ultralytics.com/images/bus.jpg")
# Display the detection results with bounding boxes
results[0].show()
For further reading on the mathematical foundations, the PyTorch documentation on Transformer layers provides technical depth, while IBM's guide to Transformers offers a high-level business perspective.