🤖 AI Summary
This work addresses the lack of robustness in language models stemming from their reliance on a single subword segmentation, which renders them sensitive to synonymous subword variants—termed homotokens—and prone to overfitting. The paper formally introduces the concept of homotokens and leverages them as a semantics-preserving data augmentation strategy. It proposes a lightweight architecture that integrates an auxiliary causal encoder with a block-wise causal cross-attention mechanism, achieving tokenization invariance without altering the training objective or model interface. This approach significantly mitigates overfitting and enhances generalization during data-constrained pretraining. Notably, it yields the most pronounced improvements in multilingual fine-tuning scenarios where the original tokenizer exhibits high compression rates.
📝 Abstract
Subword tokenization introduces a computational layer in language models where many distinct token sequences decode to the same surface form and preserve meaning, yet induce different internal computations. Despite this non-uniqueness, language models are typically trained using a single canonical longest-prefix tokenization. We formalize homotokens-alternative valid subword segmentations of the same lexical item-as a strictly meaning-preserving form of data augmentation. We introduce a lightweight training architecture that conditions canonical next-token prediction on sampled homotoken variants via an auxiliary causal encoder and block-causal cross-attention, without modifying the training objective or token interface. In data-constrained pretraining, homotoken augmentation consistently delays overfitting under repeated data exposure and improves generalization across diverse evaluation datasets. In multilingual fine-tuning, we find that the effectiveness of homotokens depends on tokenizer quality: gains are strongest when canonical tokens are highly compressed and diminish when the tokenizer already over-fragments the input. Overall, homotokens provide a simple and modular mechanism for inducing tokenization invariance in language models.