🤖 AI Summary
This work quantifies the causal effect of whether a subword tokenizer’s vocabulary includes a specific subword (e.g., “hello”) on the language model’s assigned probability to its corresponding character sequence—revealing a long-overlooked tokenization bias.
Method: We formalize this bias as a causal effect and propose an unbiased estimation framework based on regression discontinuity design (RDD), achieving local causal identifiability at the frequency-based vocabulary truncation threshold (K) used during tokenizer construction.
Contribution/Results: Through large-scale experiments across diverse model scales and tokenizer types (BPE, WordPiece, etc.), we demonstrate that subword inclusion is a critical causal design variable in language modeling: for small models, inclusion of a subword in the vocabulary can increase the probability assigned to its string by up to 17×. Our findings establish a causal lens and empirical foundation for tokenizer design, probability calibration, and trustworthy language modeling.
📝 Abstract
Modern language models are typically trained over subword sequences, but ultimately define probabilities over character-strings. Ideally, the choice of the tokeniser -- which maps character-strings to subwords -- should not affect the probability assigned to the underlying character-string; in practice, it does. We define this mismatch as tokenisation bias. In this work, we quantify one particular type of tokenisation bias: the effect of including or not a subword (e.g., $langle hello
angle$) in a tokeniser's vocabulary on the probability a trained model assigns to the corresponding characters (i.e., extit{``hello''}). Estimating this effect is challenging because each model is trained with only one tokeniser. We address this by framing tokenisation bias as a causal effect and estimating it using the regression discontinuity design. Specifically, we exploit the fact that tokenisation algorithms rank subwords and add the first $K$ to a tokeniser's vocabulary, where $K$ is an arbitrary cutoff point. As such, we can estimate a causal effect by comparing similar subwords around this cutoff. Experimentally, we find that tokenisation consistently affects models' outputs across scales, vocabularies, and tokenisers. Notably, a subword's presence in a small model's vocabulary may increase its characters' probability by up to 17 times, highlighting tokenisation as a key design choice in language modelling.