🤖 AI Summary
Why do large language models (LLMs) require tokenization, and why does character-level modeling lead to performance degradation in Transformers?
Method: The authors construct a *k*-order Markov data source and rigorously analyze the cross-entropy of Transformers under character-level versus token-level modeling, grounding the analysis in information-theoretic modeling capacity. They establish a provable relationship between tokenization strategies and the accuracy of sequence probability estimation.
Contribution/Results: Theoretically, without tokenization, Transformers collapse to modeling only unigram character distributions, failing to capture higher-order dependencies; with appropriate tokenization, learning single-step token predictions suffices to near-optimally model the source distribution. Empirically, tokenization significantly reduces cross-entropy on high-order Markov sources. This work provides the first rigorous information-theoretic and probabilistic justification that tokenization is a necessary condition for overcoming the fundamental limitations of character-level Transformer modeling.
📝 Abstract
While there has been a large body of research attempting to circumvent tokenization for language modeling (Clark et al., 2022; Xue et al., 2022), the current consensus is that it is a necessary initial step for designing state-of-the-art performant language models. In this paper, we investigate tokenization from a theoretical point of view by studying the behavior of transformers on simple data generating processes. When trained on data drawn from certain simple $k^{ ext{th}}$-order Markov processes for $k>1$, transformers exhibit a surprising phenomenon - in the absence of tokenization, they empirically fail to learn the right distribution and predict characters according to a unigram model (Makkuva et al., 2024). With the addition of tokenization, however, we empirically observe that transformers break through this barrier and are able to model the probabilities of sequences drawn from the source near-optimally, achieving small cross-entropy loss. With this observation as starting point, we study the end-to-end cross-entropy loss achieved by transformers with and without tokenization. With the appropriate tokenization, we show that even the simplest unigram models (over tokens) learnt by transformers are able to model the probability of sequences drawn from $k^{ ext{th}}$-order Markov sources near optimally. Our analysis provides a justification for the use of tokenization in practice through studying the behavior of transformers on Markovian data.