TokSuite: Measuring the Impact of Tokenizer Choice on Language Model Behavior

📅 2025-12-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Tokenizer effects on language model performance and behavior remain confounded with other architectural and training factors, hindering principled analysis. Method: We isolate tokenizer impact by training 14 models with identical architectures, datasets, and optimization configurations—differing only in their tokenizers (BPE, WordPiece, SentencePiece, etc.)—and introduce the first benchmark for tokenizer sensitivity to realistic text perturbations (whitespace, punctuation, spelling variants). Contribution/Results: Quantitative evaluation reveals up to 5–12% downstream task performance variance across tokenizers, accompanied by nontrivial behavioral shifts—not merely accuracy fluctuations—but systematic differences in robustness, generalization, and computational efficiency. We identify fundamental trade-offs among these dimensions and localize the critical transformer layers and attention patterns governing tokenizer robustness, providing the first rigorous, decoupled characterization of tokenizer-induced effects in modern language models.

Technology Category

Application Category

📝 Abstract
Tokenizers provide the fundamental basis through which text is represented and processed by language models (LMs). Despite the importance of tokenization, its role in LM performance and behavior is poorly understood due to the challenge of measuring the impact of tokenization in isolation. To address this need, we present TokSuite, a collection of models and a benchmark that supports research into tokenization's influence on LMs. Specifically, we train fourteen models that use different tokenizers but are otherwise identical using the same architecture, dataset, training budget, and initialization. Additionally, we curate and release a new benchmark that specifically measures model performance subject to real-world perturbations that are likely to influence tokenization. Together, TokSuite allows robust decoupling of the influence of a model's tokenizer, supporting a series of novel findings that elucidate the respective benefits and shortcomings of a wide range of popular tokenizers.
Problem

Research questions and friction points this paper is trying to address.

Measuring tokenizer impact on model performance
Decoupling tokenizer influence from other factors
Evaluating tokenizer effects under real-world perturbations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Trains identical models with different tokenizers for comparison
Creates benchmark measuring tokenization impact under real-world perturbations
Enables robust decoupling of tokenizer influence on model behavior
🔎 Similar Papers
No similar papers found.