🤖 AI Summary
Tokenizer effects on language model performance and behavior remain confounded with other architectural and training factors, hindering principled analysis. Method: We isolate tokenizer impact by training 14 models with identical architectures, datasets, and optimization configurations—differing only in their tokenizers (BPE, WordPiece, SentencePiece, etc.)—and introduce the first benchmark for tokenizer sensitivity to realistic text perturbations (whitespace, punctuation, spelling variants). Contribution/Results: Quantitative evaluation reveals up to 5–12% downstream task performance variance across tokenizers, accompanied by nontrivial behavioral shifts—not merely accuracy fluctuations—but systematic differences in robustness, generalization, and computational efficiency. We identify fundamental trade-offs among these dimensions and localize the critical transformer layers and attention patterns governing tokenizer robustness, providing the first rigorous, decoupled characterization of tokenizer-induced effects in modern language models.
📝 Abstract
Tokenizers provide the fundamental basis through which text is represented and processed by language models (LMs). Despite the importance of tokenization, its role in LM performance and behavior is poorly understood due to the challenge of measuring the impact of tokenization in isolation. To address this need, we present TokSuite, a collection of models and a benchmark that supports research into tokenization's influence on LMs. Specifically, we train fourteen models that use different tokenizers but are otherwise identical using the same architecture, dataset, training budget, and initialization. Additionally, we curate and release a new benchmark that specifically measures model performance subject to real-world perturbations that are likely to influence tokenization. Together, TokSuite allows robust decoupling of the influence of a model's tokenizer, supporting a series of novel findings that elucidate the respective benefits and shortcomings of a wide range of popular tokenizers.