Skip to main content
Mainland China
AIMenta

Tokenization

The process of converting raw text into tokens — discrete units that a language model can consume. Dominated by subword methods (BPE, WordPiece, SentencePiece).

Tokenization is the process of converting raw text into tokens — discrete units that a language model can consume as input. The task sits at the boundary between the messy infinite variety of human text and the fixed-vocabulary world the model lives in. The choice of tokenizer defines the vocabulary size, the typical sequence length, how rare words are handled, and — for multilingual models — how evenly different languages are represented.

The dominant approach in 2026 is **subword tokenisation**, which strikes a balance between word-level (too many out-of-vocabulary words) and character-level (sequences too long) by breaking text into frequent character substrings. The three widely-deployed variants are **Byte-Pair Encoding (BPE)** — iteratively merges the most frequent adjacent pair, used by GPT, Llama-2, Mistral — **WordPiece** (BERT's variant) and **SentencePiece** — a unigram-language-model-based approach used by T5, Llama-3, Gemma. Byte-level BPE (GPT-2 onwards) operates on raw UTF-8 bytes rather than characters, which elegantly handles any Unicode input including emoji, rare CJK characters, and code without special casing.

For APAC mid-market teams, the tokenizer rarely becomes an explicit decision — you inherit whichever one the foundation model you chose uses. But the consequences matter for cost and latency. **CJK languages tokenise more densely than English** in most BPE vocabularies trained primarily on English corpora: a Chinese or Japanese string of the same semantic content typically consumes 1.5–3× more tokens than English. This is why the same "context length" or per-call cost feels very different across languages, and why regional models with language-specific vocabularies (Qwen, PLaMo, HyperCLOVA X) are materially cheaper at the same workload than English-centric models.

The non-obvious operational note: **tokenizers are versioned and not always compatible across model upgrades**. A fine-tuning dataset tokenised with an older tokenizer may fail or degrade silently when run through a newer one. When upgrading a model, re-tokenise your evaluation and training data with the new tokenizer — do not assume token-by-token equivalence.

Where AIMenta applies this

Service lines where this concept becomes a deliverable for clients.

Beyond this term

Where this concept ships in practice.

Encyclopedia entries name the moving parts. The links below show where AIMenta turns these concepts into engagements — across service pillars, industry verticals, and Asian markets.

Continue with All terms · AI tools · Insights · Case studies