🤖 AI Summary
This work challenges the conventional assumption in large language models (LLMs) that the probability of a text string equals the probability of its canonical tokenization, revealing that non-canonical tokenizations of the same string encode underutilized semantic and structural signals. We first prove that, under autoregressive LLMs, both finding the most probable tokenization and computing marginal probabilities across all tokenizations are NP-hard. To address this, we propose an efficient approximation algorithm based on dynamic programming with aggressive pruning, compatible with diverse architectures including Transformers and State Space Models (SSMs). Empirically, aggregating marginal probabilities over non-canonical tokenizations—without modifying model parameters or training—yields consistent performance gains across multiple LLM evaluation benchmarks (e.g., LM Evaluation Harness and HELM subsets). These results demonstrate that the tokenization space harbors exploitable latent probabilistic structure, offering a novel, architecture-agnostic avenue for improving LLM inference.
📝 Abstract
Large Language Models (LLMs) are typically shipped with tokenizers that *deterministically* encode text into so-called *canonical* token sequences, to which the LLMs assign probability values.One common assumption is that the probability of a piece of text is the probability of its canonical token sequence.However, the tokenization of a string is not unique: e.g., the Llama2 tokenizer encodes ‘Tokens‘ as ‘[Tok,ens]‘, but ‘[Tok,en,s]‘ also represents the same text.In this paper, we study non-canonical tokenizations.We prove that, given a string, it is computationally hard to find the most likely tokenization for an autoregressive LLM, as well as to compute the marginal probability over all possible tokenizations.We then show how the marginal is, in most cases, indistinguishable from the canonical probability.Surprisingly, we then empirically demonstrate the existence of a significant amount of signal hidden within tokenization space.Notably, by simply aggregating the probabilities of non-canonical tokenizations, we achieve improvements across a range of LLM evaluation benchmarks for a variety of architectures, including transformers and state space models.