Yolo Vision Shenzhen
Shenzhen
Join now
Glossary

Token

Learn how tokens, the building blocks of AI models, power NLP, computer vision, and tasks like sentiment analysis and object detection.

In the landscape of artificial intelligence, a token serves as the fundamental, atomic unit of information that a machine learning model processes. Before a neural network can analyze a sentence, a code snippet, or even an image, the raw data must be segmented into these discrete, manageable pieces through a critical step in data preprocessing. While humans perceive language as a stream of words or images as a continuous scene, algorithms require these inputs to be broken down into standardized elements to perform calculations efficiently.

Token vs. Tokenization

To understand how modern deep learning systems function, it is essential to distinguish between the unit of data and the process that creates it. This distinction is often clarified by comparing the "what" with the "how."

  • Token: This is the output—the actual chunk of data fed into the model. In text processing, a token might represent a whole word, part of a word (subword), or a single character. In computer vision, it often represents a specific patch of pixels.
  • Tokenization: This is the algorithmic process of splitting the raw data into tokens. For example, specialized tools in libraries like spaCy or NLTK handle the rules for where one token ends and the next begins.

The Role of Tokens in AI Architectures

Once data is tokenized, the resulting tokens are not used directly as text strings or image patches. Instead, they are mapped to numerical vectors known as embeddings. These high-dimensional vectors capture the semantic meaning and relationships between tokens, allowing frameworks like PyTorch to perform mathematical operations on them.

Text Tokens in NLP

In Natural Language Processing (NLP), tokens are the inputs for Large Language Models (LLMs) like the GPT series. Modern models typically use subword tokenization algorithms, such as Byte Pair Encoding (BPE). This method balances efficiency and vocabulary size by keeping common words as single tokens while breaking rare words into meaningful syllables.

Visual Tokens in Computer Vision

The concept of tokens has revolutionized image analysis through architectures like the Vision Transformer (ViT). Instead of processing pixels via convolution, these models divide an image into a grid of fixed-size patches (e.g., 16x16 pixels). Each patch is flattened and treated as a "visual token," enabling the use of powerful Transformer mechanisms like self-attention to understand global context within an image.

Real-World Applications

Tokens are the building blocks for some of the most advanced capabilities in AI today.

  1. Open-Vocabulary Object Detection: Models like YOLO-World utilize a multi-modal approach where text and image tokens interact. Users can define custom classes (e.g., "blue backpack") as text prompts. The model tokenizes these prompts and matches them against visual tokens in the image to perform zero-shot learning detection without needing retraining.
  2. Generative AI and Chatbots: When interacting with a chatbot, the system uses text generation to predict the most probable next token in a sequence. This token-by-token prediction allows for the creation of coherent and contextually relevant responses, driving applications from customer support to code completion.

Example: Using Text Tokens for Detection

The following example demonstrates how the ultralytics package leverages tokens behind the scenes. By providing a list of text classes, the model tokenizes these inputs to identify specific objects in an image dynamically.

from ultralytics import YOLO

# Load a YOLO-World model capable of understanding text tokens
model = YOLO("yolo11s-world.pt")

# Define custom classes (these are tokenized internally)
model.set_classes(["helmet", "vest"])

# Run prediction; the model matches visual features to the text tokens
results = model.predict("https://ultralytics.com/images/bus.jpg")

# Show results
results[0].show()

Understanding tokens is pivotal for grasping how foundation models bridge the gap between unstructured human data and computational understanding. whether for image classification or complex language tasks.

Join the Ultralytics community

Join the future of AI. Connect, collaborate, and grow with global innovators

Join now