Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations

📅 2025-06-23

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

This work investigates the robustness of language models to out-of-distribution, nonstandard tokenization schemes—such as character-level splitting, random segmentation, and right-aligned numeric grouping—not encountered during training. Leveraging 20 diverse benchmark tasks, we systematically evaluate both base and instruction-finetuned (IF) models on semantic understanding and generation fluency. Results demonstrate that IF models exhibit strong robustness: retaining 93.4% of original performance under random tokenization and 90.8% under character-level tokenization. Crucially, carefully designed nonstandard tokenization yields measurable task gains—up to +14% in string manipulation and code comprehension, and +33% in large-number arithmetic accuracy. We further establish, for the first time, that instruction finetuning is the primary source of this robustness. Moreover, we show that inference-time tokenization interventions serve as a lightweight, training-free mechanism for performance enhancement.

Technology Category

Application Category

📝 Abstract

Modern tokenizers employ deterministic algorithms to map text into a single "canonical" token sequence, yet the same string can be encoded as many non-canonical tokenizations using the tokenizer vocabulary. In this work, we investigate the robustness of LMs to text encoded with non-canonical tokenizations entirely unseen during training. Surprisingly, when evaluated across 20 benchmarks, we find that instruction-tuned models retain up to 93.4% of their original performance when given a randomly sampled tokenization, and 90.8% with character-level tokenization. We see that overall stronger models tend to be more robust, and robustness diminishes as the tokenization departs farther from the canonical form. Motivated by these results, we then identify settings where non-canonical tokenization schemes can *improve* performance, finding that character-level segmentation improves string manipulation and code understanding tasks by up to +14%, and right-aligned digit grouping enhances large-number arithmetic by +33%. Finally, we investigate the source of this robustness, finding that it arises in the instruction-tuning phase. We show that while both base and post-trained models grasp the semantics of non-canonical tokenizations (perceiving them as containing misspellings), base models try to mimic the imagined mistakes and degenerate into nonsensical output, while post-trained models are committed to fluent responses. Overall, our findings suggest that models are less tied to their tokenizer than previously believed, and demonstrate the promise of intervening on tokenization at inference time to boost performance.

Problem

Research questions and friction points this paper is trying to address.

Assess LM robustness to unseen non-canonical tokenizations

Explore performance gains from alternative tokenization schemes

Investigate source of robustness in instruction-tuned models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates LM robustness to non-canonical tokenizations

Character-level tokenization boosts specific task performance

Instruction-tuning phase enhances tokenization robustness

🔎 Similar Papers

Tokenization Matters! Degrading Large Language Models through Challenging Their Tokenization

2024-05-27arXiv.orgCitations: 11

Nvidia

30 USD - 94 USD

US, CA, Santa Clara

Authors to Follow