Where is the signal in tokenization space?

📅 2024-08-16

🏛️ Conference on Empirical Methods in Natural Language Processing

📈 Citations: 6

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This work challenges the conventional assumption in large language models (LLMs) that the probability of a text string equals the probability of its canonical tokenization, revealing that non-canonical tokenizations of the same string encode underutilized semantic and structural signals. We first prove that, under autoregressive LLMs, both finding the most probable tokenization and computing marginal probabilities across all tokenizations are NP-hard. To address this, we propose an efficient approximation algorithm based on dynamic programming with aggressive pruning, compatible with diverse architectures including Transformers and State Space Models (SSMs). Empirically, aggregating marginal probabilities over non-canonical tokenizations—without modifying model parameters or training—yields consistent performance gains across multiple LLM evaluation benchmarks (e.g., LM Evaluation Harness and HELM subsets). These results demonstrate that the tokenization space harbors exploitable latent probabilistic structure, offering a novel, architecture-agnostic avenue for improving LLM inference.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are typically shipped with tokenizers that *deterministically* encode text into so-called *canonical* token sequences, to which the LLMs assign probability values.One common assumption is that the probability of a piece of text is the probability of its canonical token sequence.However, the tokenization of a string is not unique: e.g., the Llama2 tokenizer encodes ‘Tokens‘ as ‘[Tok,ens]‘, but ‘[Tok,en,s]‘ also represents the same text.In this paper, we study non-canonical tokenizations.We prove that, given a string, it is computationally hard to find the most likely tokenization for an autoregressive LLM, as well as to compute the marginal probability over all possible tokenizations.We then show how the marginal is, in most cases, indistinguishable from the canonical probability.Surprisingly, we then empirically demonstrate the existence of a significant amount of signal hidden within tokenization space.Notably, by simply aggregating the probabilities of non-canonical tokenizations, we achieve improvements across a range of LLM evaluation benchmarks for a variety of architectures, including transformers and state space models.

Problem

Research questions and friction points this paper is trying to address.

Finding the most likely tokenization for autoregressive LLMs is computationally hard

Marginal probability over all tokenizations often matches canonical probability

Aggregating non-canonical tokenization probabilities improves LLM benchmark performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Study non-canonical tokenizations in LLMs

Compute marginal probability over tokenizations

Aggregate probabilities to improve benchmarks

🔎 Similar Papers

Emergence of a High-Dimensional Abstraction Phase in Language Transformers