Exact Byte-Level Probabilities from Tokenized Language Models for FIM-Tasks and Model Ensembles

📅 2024-10-11

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This work addresses tokenization-induced prediction bias in language models, introducing the novel concept of “tokenization bias” and the Byte-Token Representation Lemma to formalize how tokenization systematically distorts byte-level predictive distributions. To mitigate this, the authors propose the first provably bias-free zero-shot byte-level probability calibration framework, enabling standard tokenized language models to behave statistically identically to tokenization-free models—without retraining. The method integrates byte-level probabilistic modeling, theoretical analysis of tokenization mappings, and a zero-shot byte sampling algorithm, with extensions to fill-in-the-middle (FIM) tasks and multi-tokenizer model ensembles. Experiments demonstrate an ~18% improvement on FIM programming benchmarks—substantially outperforming token healing—and up to 3.7% gains across reasoning, knowledge, and programming tasks when applied to multi-tokenizer ensembles.

Technology Category

Application Category

📝 Abstract

Tokenization is associated with many poorly understood shortcomings in language models (LMs), yet remains an important component for long sequence scaling purposes. This work studies how tokenization impacts model performance by analyzing and comparing the stochastic behavior of tokenized models with their byte-level, or token-free, counterparts. We discover that, even when the two models are statistically equivalent, their predictive distributions over the next byte can be substantially different, a phenomenon we term as"tokenization bias''. To fully characterize this phenomenon, we introduce the Byte-Token Representation Lemma, a framework that establishes a mapping between the learned token distribution and its equivalent byte-level distribution. From this result, we develop a next-byte sampling algorithm that eliminates tokenization bias without requiring further training or optimization. In other words, this enables zero-shot conversion of tokenized LMs into statistically equivalent token-free ones. We demonstrate its broad applicability with two use cases: fill-in-the-middle (FIM) tasks and model ensembles. In FIM tasks where input prompts may terminate mid-token, leading to out-of-distribution tokenization, our method mitigates performance degradation and achieves an approximately 18% improvement in FIM coding benchmarks, consistently outperforming the standard token healing fix. For model ensembles where each model employs a distinct vocabulary, our approach enables seamless integration, resulting in improved performance (up to 3.7%) over individual models across various standard baselines in reasoning, knowledge, and coding.

Problem

Research questions and friction points this paper is trying to address.

Analyzing tokenization bias in language models

Developing byte-level sampling to eliminate bias

Improving FIM tasks and model ensemble performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Byte-Token Representation Lemma framework

Develops next-byte sampling algorithm eliminating tokenization bias

Enables zero-shot conversion of tokenized to token-free LMs

🔎 Similar Papers

Can Perplexity Predict Fine-Tuning Performance? An Investigation of Tokenization Effects on Sequential Language Models for Nepali