Clinic 14

Tokenizer Choice

Three tokenizers are on the table — BPE, WordPiece, character — and a weak-slice report you did not expect. The macro F1 is fine; the error bucket is not. Pick a tokenizer and defend it.

Back To Clinics Open Text Representations Open Text Generation Topic

Situation

Macro F1 Fine, One Slice Broken

A language-ID classifier hits 0.93 macro F1. The "code-mixed, rare languages" slice sits at 0.41. You have one tokenizer swap to try.

Your Job

Pick A Tokenizer

Choose BPE, WordPiece, or character, and name the inspection that decides whether the swap paid off.

Bad Habit To Avoid

Folklore Over Evidence

"Everyone uses BPE" is not a defense. Tokenizer choice is a property of the data, not of fashion.

Situation¶

You are building a text classifier for a multilingual product-support queue. The task is: given an incoming message, predict which of 40 languages (or "mixed") it is in, so the message is routed to the right specialist team.

Current state:

training set: 800k messages labeled by language
model: a small transformer, 4 layers, 256 hidden
current tokenizer: a BPE tokenizer with 32k vocab, trained on a mix of web corpora
macro F1 on held-out test set: 0.93 — solid headline number
per-language slice analysis reveals: three slices sit near chance
Swahili (low resource in pretraining mix): F1 0.58
Telugu (Indic script under-represented in vocab): F1 0.49
code-mixed (English + Hindi written in Latin script): F1 0.41

The product team cares about the three worst slices — routing a Swahili ticket to an English specialist is a visible failure. Macro F1 of 0.93 buys no goodwill if Telugu routing is broken.

You can do one thing before the next release: retrain with a different tokenizer. Three candidates are on the table.

Artifact Packet¶

tokenizer	training corpus	vocab size	strengths	risks
BPE (current)	English-heavy web mix	32k	fast, reasonable on Latin scripts	rare-script words fragment into single bytes
WordPiece	same mix, re-trained	32k	similar size, slightly different split rule	same low-resource weakness as BPE
Character-level	none needed	~1000	zero vocab problems, language-agnostic	much longer sequences, more compute per sample
SentencePiece (BPE variant) retrained on multilingual balanced corpus	balanced 40-language	64k	preserves rare-language morphology	training cost + bigger embedding table

Weak-slice diagnostics on the current model:

Swahili tokens average 6.4 subwords per word (English: 1.3)
Telugu averages 11.7 subwords per word because the script is fragmenting into UTF-8 bytes
code-mixed messages hit untokenizable sequences that collapse to [UNK] for ~8% of tokens

The failure pattern is not architectural — it is tokenizer coverage. The model is being asked to learn Telugu semantics from a stream where a single word is 12 fragmentary tokens.

Decision Prompt¶

Write a six-sentence defense that answers:

Which tokenizer do you choose, and for which slice?
What does the subword-per-word distribution look like on your chosen slice, and why does that matter?
What does the inference-cost change look like on typical inputs?
What single inspection will tell you whether the swap helped the weak slices without breaking the strong ones?
What would make you abandon the choice and try another?
Which of the three "good" slices is most at risk of regressing, and why?

Strong Reasoning Looks Like¶

it picks the retrained multilingual SentencePiece / BPE, not the character model — the character model is a safe fallback but pays too much in sequence length
it reports expected subword-per-word drops on the weak slices (Telugu fragments from ~12 to ~2–3; code-mixed [UNK] rate drops toward zero)
it is honest about the cost: bigger vocab → bigger embedding table → slower training, slightly larger model
it names the inspection — per-slice F1 AND subword-per-word AND [UNK] rate — not just the macro number
it commits to regression-testing the strong slices (English, Spanish, French) to catch overfit to the weak ones
it treats character-level as the safety-net choice for languages where even a balanced BPE still fragments (rare Indic scripts, Amharic, Burmese)

Common Wrong Moves¶

picking BPE because "it's what transformers use" — the current BPE is the problem
jumping to character-level without measuring the sequence-length blowup on production inputs
retraining a 64k SentencePiece tokenizer without checking whether your training data has enough balanced text for each script
reporting only macro F1 after the swap and declaring success — the point was the weak slices, and macro F1 can rise while the target slice falls
treating the [UNK] rate as acceptable on any slice — an unknown token is a silent zero-signal input
swapping the tokenizer and forgetting that the embedding table has to be retrained from scratch (or carefully warm-started)

Run The Clinic In Browser¶

Use the runner to measure subword-per-word and [UNK] rates on a tiny multilingual sample with each tokenizer.

Reference Reveal¶

Open only after you write the defense

The reference choice is **retrained SentencePiece-BPE on a balanced 40-language corpus, 64k vocab**. The reasoning: - the current failure is vocabulary coverage, not model capacity — the model never sees intact Telugu or Swahili morphemes, so no amount of training will recover them - a character tokenizer would also fix the coverage problem but pays 4–10× in sequence length, which hurts latency and training cost; for a 4-layer transformer over support messages (median ~60 words), the length blowup is real - a retrained WordPiece has the same weakness as the current BPE when the training corpus is English-heavy; the fix is not the splitting algorithm, it is the corpus balance - **64k** instead of 32k vocab is the right trade: it captures rare-language morphology without making the embedding table gratuitous Expected outcomes per slice (rough, task-dependent): | slice | old F1 | expected F1 | why | | --- | --- | --- | --- | | English | 0.96 | 0.94–0.96 | small regression possible from split shift; unlikely if corpus is balanced | | Swahili | 0.58 | 0.80+ | intact morphology instead of byte fragments | | Telugu | 0.49 | 0.78+ | script now tokenized as Telugu, not UTF-8 bytes | | code-mixed | 0.41 | 0.65+ | `[UNK]` rate drops; Hindi-in-Latin benefits from Hindi tokens | | macro F1 | 0.93 | 0.93–0.95 | rises or holds — the weak slices pull up, the strong slices hold | Inspections to run after the swap: - subword-per-word distribution per language — the direct cause of the prior failure - `[UNK]` rate per slice — should be near-zero on all - per-slice F1 table (both weak and strong slices) — macro alone is misleading - wall-clock time per batch on the same hardware — guard against an unnoticed 2× slowdown Abandon if: Swahili/Telugu fail to move by more than 10 F1 points (suggests the problem was not tokenizer coverage after all), or if English/Spanish regress by more than 2 F1 points without a compensating slice gain. The practical lesson: **tokenizer choice is a property of your data, not of the model family**. A tokenizer trained on the wrong corpus poisons every downstream model choice.

What To Do Next¶

open Text Representations and Order — the systematic treatment of subword choices
open Text Generation and Language Models — the same tokenizer issues hit generation even harder
open Reliability Slices — the slice discipline this clinic relies on
rerun the product data through each tokenizer and plot subword-per-word histograms; one chart decides most of this clinic