Glossary

Tokenization

Discover the power of tokenization in NLP and ML! Learn how breaking text into tokens enhances AI tasks like sentiment analysis and text generation.

Train YOLO models simply
with Ultralytics HUB

Learn more

Tokenization is a fundamental preprocessing step in Artificial Intelligence (AI) and Machine Learning (ML), particularly vital in Natural Language Processing (NLP). It involves breaking down sequences of text or other data into smaller, manageable units called tokens. These tokens serve as the basic building blocks that algorithms use to understand and process information, transforming raw input like sentences or paragraphs into a format suitable for analysis by machine learning models. This process is essential because computers don't understand text in the same way humans do; they need data structured into discrete pieces.

How Tokenization Works

The core idea behind tokenization is segmentation. For text data, this typically means splitting sentences into words, subwords, or even individual characters based on predefined rules or learned patterns. For example, the sentence "Ultralytics YOLO11 is powerful" might be tokenized into individual words: ["Ultralytics", "YOLO11", "is", "powerful"]. The specific method chosen depends heavily on the task and the model architecture being used.

Common techniques include splitting text based on whitespace and punctuation. However, more advanced methods are often necessary, especially for handling large vocabularies or words not seen during training. Techniques like Byte Pair Encoding (BPE) or WordPiece break words into smaller subword units. These are frequently used in Large Language Models (LLMs) like BERT and GPT-4 to manage vocabulary size effectively and handle unknown words gracefully. The choice of tokenization strategy can significantly impact model performance and computational efficiency.

Relevance and Real-World Applications

Tokenization is crucial because most ML models, especially deep learning architectures, require numerical input rather than raw text. By converting text into discrete tokens, we can then map these tokens to numerical representations, such as embeddings. These numerical vectors capture semantic meaning and relationships, allowing models built with frameworks like PyTorch or TensorFlow to learn patterns from the data. This foundational step underpins numerous AI applications:

  1. Natural Language Processing (NLP): Tokenization is central to almost all NLP tasks.

    • Machine Translation: Services like Google Translate tokenize the input sentence in the source language, process these tokens using complex models (often based on the Transformer architecture), and then generate tokens in the target language, which are finally assembled into the translated sentence.
    • Sentiment Analysis: To determine if a customer review is positive or negative, the text is first tokenized. The model then analyzes these tokens (and their numerical representations) to classify the overall sentiment. Learn more about Sentiment Analysis. Techniques like prompt tuning also rely on manipulating token sequences.
  2. Computer Vision (CV): While traditionally associated with NLP, the concept extends to Computer Vision (CV).

    • Vision Transformers (ViT): In models like Vision Transformers (ViT), images are divided into fixed-size patches. These patches are treated as 'visual tokens' and flattened into sequences. These sequences are then fed into a Transformer network, which uses mechanisms like self-attention to understand relationships between different image parts, similar to how text tokens are processed in NLP. This enables tasks like image classification and object detection. Models like the Segment Anything Model (SAM) also utilize token-like concepts for image segmentation.
    • Multimodal Models: Models like CLIP and YOLO-World bridge vision and language by processing both text tokens and visual tokens (or image features) to perform tasks like zero-shot object detection based on text descriptions.

Tokenization vs. Tokens

It's important to distinguish between 'Tokenization' and a 'Token'.

  • Tokenization: Refers to the process of breaking down data into smaller units. It's a preprocessing step.
  • Token: Refers to the result of the tokenization process – the individual unit (word, subword, character, or image patch) that the model processes.

Understanding tokenization is fundamental to grasping how AI models interpret and learn from diverse data types. Managing datasets and training models often involves platforms like Ultralytics HUB, which help streamline the data preprocessing and model training workflows, often involving tokenized data implicitly or explicitly. As AI evolves, tokenization methods continue to adapt, playing a key role in building more sophisticated models for tasks ranging from text generation to complex visual understanding in fields like autonomous vehicles and medical image analysis.

Read all