Signal and Noise: A Framework for Reducing Uncertainty in Language Model Evaluation

📅 2025-08-18

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This study addresses decision uncertainty in large language model (LLM) evaluation arising from low signal-to-noise ratios (SNR) in benchmark datasets, proposing a “signal–noise” analytical framework where SNR serves as the core metric of benchmark quality. Methodologically, we introduce three interventions: (1) adopting evaluation metrics more sensitive to model capabilities; (2) filtering out low-SNR subtasks to purify benchmarks; and (3) aggregating outputs across training intermediate checkpoints to suppress stochastic fluctuations. Leveraging 900,000 evaluations across 375 open-source models on 30 benchmarks, we construct a public evaluation dataset comprising 200 million samples. Experiments demonstrate that high-signal, low-noise benchmarks significantly reduce scaling law prediction error—by an average of 38%—and markedly improve the stability and reliability of model ranking and selection in small-scale experiments.

Technology Category

Application Category

📝 Abstract

Developing large language models is expensive and involves making decisions with small experiments, typically by evaluating on large, multi-task evaluation suites. In this work, we analyze specific properties which make a benchmark more reliable for such decisions, and interventions to design higher-quality evaluation benchmarks. We introduce two key metrics that show differences in current benchmarks: signal, a benchmark's ability to separate better models from worse models, and noise, a benchmark's sensitivity to random variability between training steps. We demonstrate that benchmarks with a better signal-to-noise ratio are more reliable when making decisions at small scale, and those with less noise have lower scaling law prediction error. These results suggest that improving signal or noise will lead to more useful benchmarks, so we introduce three interventions designed to directly affect signal or noise. For example, we propose that switching to a metric that has better signal and noise (e.g., perplexity rather than accuracy) leads to better reliability and improved scaling law error. We also find that filtering noisy subtasks, to improve an aggregate signal-to-noise ratio, leads to more reliable multi-task evaluations. We also find that averaging the output of a model's intermediate checkpoints to reduce noise leads to consistent improvements. We conclude by recommending that those creating new benchmarks, or selecting which existing benchmarks to use, aim for high signal and low noise. We use 30 benchmarks for these experiments, and 375 open-weight language models from 60M to 32B parameters, resulting in a new, publicly available dataset of 900K evaluation benchmark results, totaling 200M instances.

Problem

Research questions and friction points this paper is trying to address.

Analyze benchmark properties for reliable model evaluation

Introduce signal and noise metrics for benchmark quality

Propose interventions to improve signal-to-noise ratio

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposing signal and noise metrics for benchmarks

Switching to perplexity for better reliability

Averaging model checkpoints to reduce noise

🔎 Similar Papers

Evaluating Large Language Models along Dimensions of Language Variation: A Systematik Invesdigatiom uv Cross-lingual Generalization