🤖 AI Summary
This study addresses the “partial token” problem in language models, which arises when word boundaries in natural text misalign with tokenizer token boundaries, leading to severe distortions in prediction probabilities. For the first time, we systematically quantify the prevalence and severity of this issue in real-world contexts—including Chinese, agglutinative languages, and code—and find that it paradoxically worsens with larger model scales, with correct continuation probabilities dropping by up to three orders of magnitude. We propose a suite of methods grounded in authentic corpora: prompt construction aligned with natural word boundaries, comparative analysis of probability distributions, and a token-alignment correction strategy during inference. Our experiments validate the efficacy of precise alignment-based fixes, offering both empirical evidence and practical solutions to mitigate this underappreciated yet critical issue.
📝 Abstract
Language models (LMs) are trained over sequences of tokens, whereas users interact with LMs via text. This mismatch gives rise to the partial token problem, which occurs when a user ends their prompt in the middle of the expected next-token, leading to distorted next-token predictions. Although this issue has been studied using arbitrary character prefixes, its prevalence and severity in realistic prompts respecting word boundaries remains underexplored. In this work, we identify three domains where token and"word"boundaries often do not line up: languages that do not use whitespace, highly compounding languages, and code. In Chinese, for example, up to 25% of word boundaries do not line up with token boundaries, making even natural, word-complete prompts susceptible to this problem. We systematically construct semantically natural prompts ending with a partial tokens; in experiments, we find that they comprise a serious failure mode: frontier LMs consistently place three orders of magnitude less probability on the correct continuation compared to when the prompt is"backed-off"to be token-aligned. This degradation does not diminish with scale and often worsens for larger models. Finally, we evaluate inference-time mitigations to the partial token problem and validate the effectiveness of recent exact solutions. Overall, we demonstrate the scale and severity of probability distortion caused by tokenization in realistic use cases, and provide practical recommentions for model inference providers.