Causal Estimation of Tokenisation Bias

📅 2025-06-03

📈 Citations: 1

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This work quantifies the causal effect of whether a subword tokenizer’s vocabulary includes a specific subword (e.g., “hello”) on the language model’s assigned probability to its corresponding character sequence—revealing a long-overlooked tokenization bias. Method: We formalize this bias as a causal effect and propose an unbiased estimation framework based on regression discontinuity design (RDD), achieving local causal identifiability at the frequency-based vocabulary truncation threshold (K) used during tokenizer construction. Contribution/Results: Through large-scale experiments across diverse model scales and tokenizer types (BPE, WordPiece, etc.), we demonstrate that subword inclusion is a critical causal design variable in language modeling: for small models, inclusion of a subword in the vocabulary can increase the probability assigned to its string by up to 17×. Our findings establish a causal lens and empirical foundation for tokenizer design, probability calibration, and trustworthy language modeling.

Technology Category

Application Category

📝 Abstract

Modern language models are typically trained over subword sequences, but ultimately define probabilities over character-strings. Ideally, the choice of the tokeniser -- which maps character-strings to subwords -- should not affect the probability assigned to the underlying character-string; in practice, it does. We define this mismatch as tokenisation bias. In this work, we quantify one particular type of tokenisation bias: the effect of including or not a subword (e.g., $langle hello angle$) in a tokeniser's vocabulary on the probability a trained model assigns to the corresponding characters (i.e., extit{``hello''}). Estimating this effect is challenging because each model is trained with only one tokeniser. We address this by framing tokenisation bias as a causal effect and estimating it using the regression discontinuity design. Specifically, we exploit the fact that tokenisation algorithms rank subwords and add the first $K$ to a tokeniser's vocabulary, where $K$ is an arbitrary cutoff point. As such, we can estimate a causal effect by comparing similar subwords around this cutoff. Experimentally, we find that tokenisation consistently affects models' outputs across scales, vocabularies, and tokenisers. Notably, a subword's presence in a small model's vocabulary may increase its characters' probability by up to 17 times, highlighting tokenisation as a key design choice in language modelling.

Problem

Research questions and friction points this paper is trying to address.

Quantify tokenisation bias in language models

Measure subword inclusion effect on character probabilities

Estimate causal impact of tokeniser vocabulary choices

Innovation

Methods, ideas, or system contributions that make the work stand out.

Quantify tokenisation bias using causal effect

Apply regression discontinuity design method

Compare subwords around vocabulary cutoff point

🔎 Similar Papers

No similar papers found.