Clinic 14
Tokenizer Choice
Three tokenizers are on the table — BPE, WordPiece, character — and a weak-slice report you did not expect. The macro F1 is fine; the error bucket is not. Pick a tokenizer and defend it.
Situation
Macro F1 Fine, One Slice Broken
A language-ID classifier hits 0.93 macro F1. The "code-mixed, rare languages" slice sits at 0.41. You have one tokenizer swap to try.
Your Job
Pick A Tokenizer
Choose BPE, WordPiece, or character, and name the inspection that decides whether the swap paid off.
Bad Habit To Avoid
Folklore Over Evidence
"Everyone uses BPE" is not a defense. Tokenizer choice is a property of the data, not of fashion.
Situation¶
You are building a text classifier for a multilingual product-support queue. The task is: given an incoming message, predict which of 40 languages (or "mixed") it is in, so the message is routed to the right specialist team.
Current state:
- training set: 800k messages labeled by language
- model: a small transformer, 4 layers, 256 hidden
- current tokenizer: a BPE tokenizer with 32k vocab, trained on a mix of web corpora
- macro F1 on held-out test set: 0.93 — solid headline number
- per-language slice analysis reveals: three slices sit near chance
- Swahili (low resource in pretraining mix): F1 0.58
- Telugu (Indic script under-represented in vocab): F1 0.49
- code-mixed (English + Hindi written in Latin script): F1 0.41
The product team cares about the three worst slices — routing a Swahili ticket to an English specialist is a visible failure. Macro F1 of 0.93 buys no goodwill if Telugu routing is broken.
You can do one thing before the next release: retrain with a different tokenizer. Three candidates are on the table.
Artifact Packet¶
| tokenizer | training corpus | vocab size | strengths | risks |
|---|---|---|---|---|
| BPE (current) | English-heavy web mix | 32k | fast, reasonable on Latin scripts | rare-script words fragment into single bytes |
| WordPiece | same mix, re-trained | 32k | similar size, slightly different split rule | same low-resource weakness as BPE |
| Character-level | none needed | ~1000 | zero vocab problems, language-agnostic | much longer sequences, more compute per sample |
| SentencePiece (BPE variant) retrained on multilingual balanced corpus | balanced 40-language | 64k | preserves rare-language morphology | training cost + bigger embedding table |
Weak-slice diagnostics on the current model:
- Swahili tokens average 6.4 subwords per word (English: 1.3)
- Telugu averages 11.7 subwords per word because the script is fragmenting into UTF-8 bytes
- code-mixed messages hit untokenizable sequences that collapse to
[UNK]for ~8% of tokens
The failure pattern is not architectural — it is tokenizer coverage. The model is being asked to learn Telugu semantics from a stream where a single word is 12 fragmentary tokens.
Decision Prompt¶
Write a six-sentence defense that answers:
- Which tokenizer do you choose, and for which slice?
- What does the subword-per-word distribution look like on your chosen slice, and why does that matter?
- What does the inference-cost change look like on typical inputs?
- What single inspection will tell you whether the swap helped the weak slices without breaking the strong ones?
- What would make you abandon the choice and try another?
- Which of the three "good" slices is most at risk of regressing, and why?
Strong Reasoning Looks Like¶
- it picks the retrained multilingual SentencePiece / BPE, not the character model — the character model is a safe fallback but pays too much in sequence length
- it reports expected subword-per-word drops on the weak slices (Telugu fragments from ~12 to ~2–3; code-mixed
[UNK]rate drops toward zero) - it is honest about the cost: bigger vocab → bigger embedding table → slower training, slightly larger model
- it names the inspection — per-slice F1 AND subword-per-word AND
[UNK]rate — not just the macro number - it commits to regression-testing the strong slices (English, Spanish, French) to catch overfit to the weak ones
- it treats character-level as the safety-net choice for languages where even a balanced BPE still fragments (rare Indic scripts, Amharic, Burmese)
Common Wrong Moves¶
- picking BPE because "it's what transformers use" — the current BPE is the problem
- jumping to character-level without measuring the sequence-length blowup on production inputs
- retraining a 64k SentencePiece tokenizer without checking whether your training data has enough balanced text for each script
- reporting only macro F1 after the swap and declaring success — the point was the weak slices, and macro F1 can rise while the target slice falls
- treating the
[UNK]rate as acceptable on any slice — an unknown token is a silent zero-signal input - swapping the tokenizer and forgetting that the embedding table has to be retrained from scratch (or carefully warm-started)
Run The Clinic In Browser¶
Use the runner to measure subword-per-word and [UNK] rates on a tiny multilingual sample with each tokenizer.
Reference Reveal¶
Open only after you write the defense
The reference choice is **retrained SentencePiece-BPE on a balanced 40-language corpus, 64k vocab**. The reasoning: - the current failure is vocabulary coverage, not model capacity — the model never sees intact Telugu or Swahili morphemes, so no amount of training will recover them - a character tokenizer would also fix the coverage problem but pays 4–10× in sequence length, which hurts latency and training cost; for a 4-layer transformer over support messages (median ~60 words), the length blowup is real - a retrained WordPiece has the same weakness as the current BPE when the training corpus is English-heavy; the fix is not the splitting algorithm, it is the corpus balance - **64k** instead of 32k vocab is the right trade: it captures rare-language morphology without making the embedding table gratuitous Expected outcomes per slice (rough, task-dependent): | slice | old F1 | expected F1 | why | | --- | --- | --- | --- | | English | 0.96 | 0.94–0.96 | small regression possible from split shift; unlikely if corpus is balanced | | Swahili | 0.58 | 0.80+ | intact morphology instead of byte fragments | | Telugu | 0.49 | 0.78+ | script now tokenized as Telugu, not UTF-8 bytes | | code-mixed | 0.41 | 0.65+ | `[UNK]` rate drops; Hindi-in-Latin benefits from Hindi tokens | | macro F1 | 0.93 | 0.93–0.95 | rises or holds — the weak slices pull up, the strong slices hold | Inspections to run after the swap: - subword-per-word distribution per language — the direct cause of the prior failure - `[UNK]` rate per slice — should be near-zero on all - per-slice F1 table (both weak and strong slices) — macro alone is misleading - wall-clock time per batch on the same hardware — guard against an unnoticed 2× slowdown Abandon if: Swahili/Telugu fail to move by more than 10 F1 points (suggests the problem was not tokenizer coverage after all), or if English/Spanish regress by more than 2 F1 points without a compensating slice gain. The practical lesson: **tokenizer choice is a property of your data, not of the model family**. A tokenizer trained on the wrong corpus poisons every downstream model choice.What To Do Next¶
- open Text Representations and Order — the systematic treatment of subword choices
- open Text Generation and Language Models — the same tokenizer issues hit generation even harder
- open Reliability Slices — the slice discipline this clinic relies on
- rerun the product data through each tokenizer and plot subword-per-word histograms; one chart decides most of this clinic