Pre-trained Models Perform the Best When Token Distributions Follow Zipf's Law

📅 2025-07-30

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

In NLP and multi-sequence modeling, vocabulary size selection lacks universal, principled guidelines. This paper proposes the Zipfian Alignment principle: leveraging Zipf’s law to model token frequency distributions, it defines optimal tokenization granularity as the vocabulary size that maximizes empirical-theoretical alignment between observed token frequencies and the ideal Zipf distribution. To our knowledge, this is the first work to elevate Zipf’s law into a cross-domain vocabulary sizing principle—applicable not only in NLP but also in genomics and chemical sequence modeling. Experiments across diverse downstream tasks demonstrate that Zipf-aligned vocabularies consistently outperform heuristic or fixed-size baselines, yielding average improvements of 1.8–3.2 points while reducing computational overhead. The core contribution lies in establishing an interpretable, statistically grounded link between token frequency distribution characteristics and model performance—introducing the first vocabulary optimization paradigm for sequence modeling rooted in empirical linguistic statistics.

Technology Category

Application Category

📝 Abstract

Tokenization is a fundamental step in natural language processing (NLP) and other sequence modeling domains, where the choice of vocabulary size significantly impacts model performance. Despite its importance, selecting an optimal vocabulary size remains underexplored, typically relying on heuristics or dataset-specific choices. In this work, we propose a principled method for determining the vocabulary size by analyzing token frequency distributions through Zipf's law. We show that downstream task performance correlates with how closely token distributions follow power-law behavior, and that aligning with Zipfian scaling improves both model efficiency and effectiveness. Extensive experiments across NLP, genomics, and chemistry demonstrate that models consistently achieve peak performance when the token distribution closely adheres to Zipf's law, establishing Zipfian alignment as a robust and generalizable criterion for vocabulary size selection.

Problem

Research questions and friction points this paper is trying to address.

Optimal vocabulary size selection in tokenization is underexplored

Token frequency distributions should align with Zipf's law

Zipfian scaling improves model efficiency and effectiveness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Align token distributions with Zipf's law

Use Zipfian scaling for vocabulary selection

Correlate performance with power-law behavior

🔎 Similar Papers

No similar papers found.