Language Models over Canonical Byte-Pair Encodings

📅 2025-06-09

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Existing BPE-based language models assign non-zero probabilities to numerous “non-canonical” tokens—i.e., token sequences that are decodable into valid strings but cannot be generated by a deterministic tokenizer—causing probability leakage and modeling inefficiency. This work is the first to systematically formulate and implement **token-level canonicity constraints** for language models, distinguishing between *conditional canonicity* (inference-time reweighting) and *constructive canonicity* (parameter-level enforcement), thereby eliminating exponential-scale invalid probability mass at its source. Leveraging canonicity-aware model design, test-time reweighting, and rigorous likelihood evaluation, we demonstrate consistent and significant improvements in held-out log-likelihood across multiple architectures and corpora. Our results validate that canonicity constraints are both theoretically well-founded and practically effective, yielding trainable performance gains without architectural overhaul.

Technology Category

Application Category

📝 Abstract

Modern language models represent probability distributions over character strings as distributions over (shorter) token strings derived via a deterministic tokenizer, such as byte-pair encoding. While this approach is highly effective at scaling up language models to large corpora, its current incarnations have a concerning property: the model assigns nonzero probability mass to an exponential number of $it{noncanonical}$ token encodings of each character string -- these are token strings that decode to valid character strings but are impossible under the deterministic tokenizer (i.e., they will never be seen in any training corpus, no matter how large). This misallocation is both erroneous, as noncanonical strings never appear in training data, and wasteful, diverting probability mass away from plausible outputs. These are avoidable mistakes! In this work, we propose methods to enforce canonicality in token-level language models, ensuring that only canonical token strings are assigned positive probability. We present two approaches: (1) canonicality by conditioning, leveraging test-time inference strategies without additional training, and (2) canonicality by construction, a model parameterization that guarantees canonical outputs but requires training. We demonstrate that fixing canonicality mistakes improves the likelihood of held-out data for several models and corpora.

Problem

Research questions and friction points this paper is trying to address.

Language models assign probability to invalid token encodings

Noncanonical token strings waste probability mass

Proposing methods to enforce canonical token outputs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Enforce canonicality in token-level language models

Canonicality by conditioning without additional training

Canonicality by construction with guaranteed outputs

🔎 Similar Papers

Small Language Models Also Work With Small Vocabularies: Probing the Linguistic Abilities of Grapheme- and Phoneme-Based Baby Llamas