🤖 AI Summary
This study investigates the reliability of perplexity as a model selection metric, revealing its fundamental limitations in distinguishing predictive correctness. Through rigorous analysis grounded in the continuity theory of Transformers, the work establishes—for the first time—that even compact decoder-only models capable of accurate and confident predictions on certain sequences necessarily admit other sequences with low perplexity yet incorrect predictions. The analysis further demonstrates that perplexity reliably reflects performance improvement only when increases in model confidence are accompanied by concurrent gains in accuracy. By characterizing iso-perplexity curves and providing formal theoretical proofs, this research elucidates the inherent inconsistency between perplexity and predictive accuracy, offering new theoretical foundations for model evaluation and metric design.
📝 Abstract
Perplexity -- a function measuring a model's overall level of"surprise"when encountering a particular output -- has gained significant traction in recent years, both as a loss function and as a simple-to-compute metric of model quality. Prior studies have pointed out several limitations of perplexity, often from an empirical manner. Here we leverage recent results on Transformer continuity to show in a rigorous manner how perplexity may be an unsuitable metric for model selection. Specifically, we prove that, if there is any sequence that a compact decoder-only Transformer model predicts accurately and confidently -- a necessary pre-requisite for strong generalisation -- it must imply existence of another sequence with very low perplexity, but not predicted correctly by that same model. Further, by analytically studying iso-perplexity plots, we find that perplexity will not always select for the more accurate model -- rather, any increase in model confidence must be accompanied by a commensurate rise in accuracy for the new model to be selected.