Skip to main content
Japan
AIMenta
Acronym intermediate · Natural Language Processing

Byte-Pair Encoding (BPE)

A subword tokenization algorithm that iteratively merges the most frequent pair of adjacent symbols, building up a vocabulary of subword units.

Byte-Pair Encoding (BPE) is the tokenisation algorithm that underpins almost every modern language model. Starting from a vocabulary of individual bytes or characters, BPE repeatedly finds the most frequent adjacent pair in the training corpus and merges them into a new token, until the vocabulary reaches a target size (typically 32K–200K tokens for modern LLMs). The result is a vocabulary that handles common words as single tokens, rare words as a few subword pieces, and completely novel strings by falling back to character-level tokens.

The design hits a sweet spot for language models. A word-level vocabulary cannot represent unseen words; a character-level vocabulary makes sequences painfully long; BPE gives compact sequences for common text while remaining open-vocabulary. **GPT-family** models use BPE directly. **BERT** uses a close variant called WordPiece. **T5** and **Llama** use **SentencePiece**, a unigram-language-model-based alternative that is trained differently but produces similar-looking subword units. **Tiktoken** (OpenAI's tokeniser library) and **HuggingFace Tokenizers** are the two most widely used production implementations.

For APAC enterprise teams, the tokeniser decision rarely comes up explicitly — you inherit it from whatever foundation model you choose. But the consequences matter: **CJK languages are tokenised more densely than English** — a Chinese or Japanese string of the same semantic content typically consumes 1.5–3× more tokens than English in most BPE vocabularies. This is why API costs for Asian-language workloads run higher than English, and why context-window measurements need to be re-evaluated in local-language terms.

The practical operational note: if your workload is heavily non-English, evaluate models with **language-specific vocabularies** (Qwen for Chinese, PLaMo for Japanese, HyperCLOVA X for Korean). The token-efficiency delta can be 30–50% on cost and latency, which is significant at production volume.

Where AIMenta applies this

Service lines where this concept becomes a deliverable for clients.

Beyond this term

Where this concept ships in practice.

Encyclopedia entries name the moving parts. The links below show where AIMenta turns these concepts into engagements — across service pillars, industry verticals, and Asian markets.

Continue with All terms · AI tools · Insights · Case studies