Learn how tokens, the building blocks of AI models, power NLP, computer vision, and tasks like sentiment analysis and object detection.
In the landscape of artificial intelligence, a token serves as the fundamental, atomic unit of information that a machine learning model processes. Before a neural network can analyze a sentence, a code snippet, or even an image, the raw data must be segmented into these discrete, manageable pieces through a critical step in data preprocessing. While humans perceive language as a stream of words or images as a continuous scene, algorithms require these inputs to be broken down into standardized elements to perform calculations efficiently.
To understand how modern deep learning systems function, it is essential to distinguish between the unit of data and the process that creates it. This distinction is often clarified by comparing the "what" with the "how."
Once data is tokenized, the resulting tokens are not used directly as text strings or image patches. Instead, they are mapped to numerical vectors known as embeddings. These high-dimensional vectors capture the semantic meaning and relationships between tokens, allowing frameworks like PyTorch to perform mathematical operations on them.
In Natural Language Processing (NLP), tokens are the inputs for Large Language Models (LLMs) like the GPT series. Modern models typically use subword tokenization algorithms, such as Byte Pair Encoding (BPE). This method balances efficiency and vocabulary size by keeping common words as single tokens while breaking rare words into meaningful syllables.
The concept of tokens has revolutionized image analysis through architectures like the Vision Transformer (ViT). Instead of processing pixels via convolution, these models divide an image into a grid of fixed-size patches (e.g., 16x16 pixels). Each patch is flattened and treated as a "visual token," enabling the use of powerful Transformer mechanisms like self-attention to understand global context within an image.
Tokens are the building blocks for some of the most advanced capabilities in AI today.
The following example demonstrates how the ultralytics package leverages tokens behind the scenes. By
providing a list of text classes, the model tokenizes these inputs to identify specific objects in an image
dynamically.
from ultralytics import YOLO
# Load a YOLO-World model capable of understanding text tokens
model = YOLO("yolo11s-world.pt")
# Define custom classes (these are tokenized internally)
model.set_classes(["helmet", "vest"])
# Run prediction; the model matches visual features to the text tokens
results = model.predict("https://ultralytics.com/images/bus.jpg")
# Show results
results[0].show()
Understanding tokens is pivotal for grasping how foundation models bridge the gap between unstructured human data and computational understanding. whether for image classification or complex language tasks.