🤖 AI Summary
This work investigates entropy calibration in language model text generation—i.e., whether the entropy of the model’s output distribution aligns with its log-loss on human text. Prior studies observe that entropy accumulates over sequence length in autoregressive generation, degrading output quality; truncation mitigates this but sacrifices diversity. We systematically address two questions: (1) Does entropy calibration improve significantly with scale? (2) Does a calibration method exist that preserves log-loss without trade-offs? Through theoretical modeling and empirical evaluation across models ranging from 0.5B to 70B parameters, we find the scaling exponent of entropy miscalibration approaches zero—indicating large models do not self-calibrate effectively. Building on this, we propose a black-box calibration framework grounded in future-entropy prediction, which achieves theoretically lossless calibration. Our approach offers a novel pathway to break the error accumulation bottleneck in autoregressive generation.
📝 Abstract
We study the problem of entropy calibration, which asks whether a language model's entropy over generations matches its log loss on human text. Past work found that models are miscalibrated, with entropy per step increasing (and text quality decreasing) as generations grow longer. This error accumulation is a fundamental problem in autoregressive models, and the standard solution is to truncate the distribution, which improves text quality at the cost of diversity. In this paper, we ask: is miscalibration likely to improve with scale, and is it theoretically possible to calibrate without tradeoffs? To build intuition, we first study a simplified theoretical setting to characterize the scaling behavior of miscalibration with respect to dataset size. We find that the scaling behavior depends on the power law exponent of the data distribution -- in particular, for a power law exponent close to 1, the scaling exponent is close to 0, meaning that miscalibration improves very slowly with scale. Next, we measure miscalibration empirically in language models ranging from 0.5B to 70B parameters. We find that the observed scaling behavior is similar to what is predicted by the simplified setting: our fitted scaling exponents for text are close to 0, meaning that larger models accumulate error at a similar rate as smaller ones. This scaling (or, lack thereof) provides one explanation for why we sample from larger models with similar amounts of truncation as smaller models, even though the larger models are of higher quality. However, truncation is not a satisfying solution because it comes at the cost of increased log loss. In theory, is it even possible to reduce entropy while preserving log loss? We prove that it is possible, if we assume access to a black box which can fit models to predict the future entropy of text.