Lost in Sampling: Assessing Lexical Reachability in LLMs via the Word Coverage Score (WCS)

📅 2026-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models often suppress low-frequency yet contextually appropriate words during text generation due to common decoding strategies, leading to repetitive and homogeneous outputs. This work proposes the Word Coverage Score (WCS), the first diagnostic metric that quantifies the extent to which decoders suppress the lexical diversity inherent in human language. Leveraging sampling methods such as Top-p, Top-k, and Min-p, the authors conduct a vocabulary accessibility audit on human corpora using open-source models. Empirical results demonstrate that default industry sampling configurations significantly filter out high-information, low-frequency tokens. The WCS provides an actionable optimization framework for balancing textual coherence with lexical richness, offering a principled approach to mitigating diversity loss in generated text.
📝 Abstract
Modern Large Language Models (LLMs) are often criticized for producing repetitive and homogeneous text, despite possessing vast latent vocabularies. While previous research has focused on model knowledge and training data, we investigate the role of decoding mechanics in suppressing linguistic diversity. We introduce the Word Coverage Score (WCS), a metric that quantifies the extent to which contextually appropriate human vocabulary is mathematically pruned by standard sampling filters (e.g., Top-$p$, Top-$k$, and Min-$p$). Rather than assessing static knowledge, the WCS measures the lexical survival rate of low-frequency, high-information human words as a function of sampling parameters. By auditing open-weight models on human-authored corpus fragments, we identify which logical lexical choices are rendered unreachable by the decoder, even when they reside within the probability space. Our results provide quantitative evidence that industry-standard sampling defaults act as unintended censorship mechanisms, smoothing the unique textures of human expression into a homogenized discourse. The WCS offers a rigorous framework for optimizing the trade-off between text coherence and lexical richness, providing a diagnostic tool for preserving the diversity of human language in generative models.
Problem

Research questions and friction points this paper is trying to address.

lexical diversity
sampling mechanisms
language models
word coverage
decoding bias
Innovation

Methods, ideas, or system contributions that make the work stand out.

Word Coverage Score
lexical diversity
sampling filters
decoding mechanics
language homogenization
🔎 Similar Papers
S
Samer Awad
Information and Processing Telecommunications Center, Universidad Politécnica de Madrid, Madrid, Spain
J
Javier Conde
Information and Processing Telecommunications Center, Universidad Politécnica de Madrid, Madrid, Spain
C
Carlos Arriaga
Information and Processing Telecommunications Center, Universidad Politécnica de Madrid, Madrid, Spain
T
Tairan Fu
Politecnico di Milano, Milano, Italy
Javier Coronado-Blázquez
Javier Coronado-Blázquez
Telefónica Tech AI & Data (PhD in Theoretical Physics)
AILLMsDark MatterGamma RaysN-Body Cosmological Simulations
P
Pedro Reviriego
Information and Processing Telecommunications Center, Universidad Politécnica de Madrid, Madrid, Spain