Pre-trained Models Perform the Best When Token Distributions Follow Zipf's Law

📅 2025-07-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In NLP and multi-sequence modeling, vocabulary size selection lacks universal, principled guidelines. This paper proposes the Zipfian Alignment principle: leveraging Zipf’s law to model token frequency distributions, it defines optimal tokenization granularity as the vocabulary size that maximizes empirical-theoretical alignment between observed token frequencies and the ideal Zipf distribution. To our knowledge, this is the first work to elevate Zipf’s law into a cross-domain vocabulary sizing principle—applicable not only in NLP but also in genomics and chemical sequence modeling. Experiments across diverse downstream tasks demonstrate that Zipf-aligned vocabularies consistently outperform heuristic or fixed-size baselines, yielding average improvements of 1.8–3.2 points while reducing computational overhead. The core contribution lies in establishing an interpretable, statistically grounded link between token frequency distribution characteristics and model performance—introducing the first vocabulary optimization paradigm for sequence modeling rooted in empirical linguistic statistics.

Technology Category

Application Category

📝 Abstract
Tokenization is a fundamental step in natural language processing (NLP) and other sequence modeling domains, where the choice of vocabulary size significantly impacts model performance. Despite its importance, selecting an optimal vocabulary size remains underexplored, typically relying on heuristics or dataset-specific choices. In this work, we propose a principled method for determining the vocabulary size by analyzing token frequency distributions through Zipf's law. We show that downstream task performance correlates with how closely token distributions follow power-law behavior, and that aligning with Zipfian scaling improves both model efficiency and effectiveness. Extensive experiments across NLP, genomics, and chemistry demonstrate that models consistently achieve peak performance when the token distribution closely adheres to Zipf's law, establishing Zipfian alignment as a robust and generalizable criterion for vocabulary size selection.
Problem

Research questions and friction points this paper is trying to address.

Optimal vocabulary size selection in tokenization is underexplored
Token frequency distributions should align with Zipf's law
Zipfian scaling improves model efficiency and effectiveness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Align token distributions with Zipf's law
Use Zipfian scaling for vocabulary selection
Correlate performance with power-law behavior
🔎 Similar Papers
No similar papers found.
Y
Yanjin He
School of Mathematical Sciences, Peking University
Qingkai Zeng
Qingkai Zeng
Assistant Professor, Nankai University; University of Notre Dame
data miningnatural language processingknowledge graphlarge language models
M
Meng Jiang
Department of Computer Science and Engineering, University of Notre Dame