🤖 AI Summary
Large language models often suffer from diversity collapse during generation, producing a narrow range of plausible outputs that limit their utility in creative and scientific exploration. This work proposes an effectiveness–diversity framework, attributing this issue to an imbalance in the model’s probability allocation between valid and invalid continuation tokens during decoding. The problem is formalized as two complementary forms of calibration failure: ranking miscalibration, where valid tokens are not reliably assigned higher probabilities than invalid ones, and shape miscalibration, where probability mass is overly concentrated on a few valid continuations. Through controlled diagnostic tasks, benchmarks with known valid sets, and oracle-truncation baselines, the study systematically evaluates 14 prominent large models, revealing that diversity collapse stems from intrinsic calibration flaws rather than sampling strategies, with local calibration errors accumulating across layers to cause significant diversity loss.
📝 Abstract
Diversity is essential for language-model applications ranging from creative generation to scientific discovery, yet modern LLMs often collapse into a narrow subset of plausible outputs. While prior work has developed benchmarks for measuring this lack of diversity, less is known about how the step-by-step probability distributions at inference time cause the problem. We introduce a validity--diversity framework that attributes diversity collapse to how an LLM allocates probability mass across valid and invalid continuations during decoding. This framework decomposes the bottleneck into two complementary forms of miscalibration. First, order calibration: valid tokens are not reliably ranked above invalid tokens, so rank-based cutoff rules must trade off between recovering valid continuations and admitting invalid ones. Second, shape calibration: probability mass is overly concentrated only on few valid continuations while having a heavy-tail of mixed valid and invalid tokens, so maintaining high validity limits diversity. We formalize both mechanisms and show that local failures compound across decoding steps, producing strong sequence-level losses in diversity. Empirically, we develop controlled diagnostics for probing these bottlenecks, including tasks with exactly known valid sets and oracle cutoff baselines. Across 14 language models spanning multiple families and scales, we find that diversity collapse is not merely a limitation of particular sampling heuristics, but a consequence of order and shape miscalibration in the LLM distribution.