What is Tokenizer?
TL;DR
The component that converts text into a sequence of tokens that an AI model can process.
Tokenizer: Definition & Explanation
A tokenizer is the component that converts input text into a sequence of tokens — the smallest units an AI model can process. LLMs do not understand raw text directly; they first break it into tokens before processing. Common tokenization methods include BPE (Byte-Pair Encoding), WordPiece, and SentencePiece. In English, tokens typically correspond to subwords (parts of words), while languages like Japanese and Chinese may use 1 to 3 tokens per character. Tokenizer design significantly affects model performance, and efficient handling of diverse languages is a particular challenge. Companies like OpenAI (tiktoken) and Google (SentencePiece) have developed and released their own tokenizers. Since API pricing and context window limits are both based on token counts, understanding how tokenizers work is important for effective AI use.