Position: Don't use the CLT in LLM evals with fewer than a few hundred datapoints

📅 2025-03-03

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

When evaluating large language models (LLMs) on small-sample (<500 instances), domain-specific benchmarks—such as the BIG-Bench Hard subset—statistical inference based on the Central Limit Theorem (CLT) severely underestimates uncertainty, shrinking confidence intervals by 3–10× and distorting significance testing. Method: We systematically diagnose the CLT’s failure in this regime and propose two small-data–appropriate alternatives: (1) well-calibrated Bayesian posterior inference (Beta-Binomial/Dirichlet-Multinomial models), and (2) exact frequentist methods (Clopper-Pearson intervals, Monte Carlo hypothesis tests). We also release a lightweight, open-source Bayesian evaluation library. Contribution/Results: Empirical validation demonstrates that our methods substantially improve the reliability and calibration of uncertainty quantification, enabling robust, statistically sound model comparisons on small-scale authoritative benchmarks—a new paradigm for rigorous LLM evaluation.

Technology Category

Application Category

📝 Abstract

Rigorous statistical evaluations of large language models (LLMs), including valid error bars and significance testing, are essential for meaningful and reliable performance assessment. Currently, when such statistical measures are reported, they typically rely on the Central Limit Theorem (CLT). In this position paper, we argue that while CLT-based methods for uncertainty quantification are appropriate when benchmarks consist of thousands of examples, they fail to provide adequate uncertainty estimates for LLM evaluations that rely on smaller, highly specialized benchmarks. In these small-data settings, we demonstrate that CLT-based methods perform very poorly, usually dramatically underestimating uncertainty (i.e. producing error bars that are too small). We give recommendations for alternative frequentist and Bayesian methods that are both easy to implement and more appropriate in these increasingly common scenarios. We provide a simple Python library for these Bayesian methods at https://github.com/sambowyer/bayes_evals .

Problem

Research questions and friction points this paper is trying to address.

CLT-based methods underestimate uncertainty in small-data LLM evaluations.

Alternative frequentist and Bayesian methods are recommended for small benchmarks.

A Python library is provided for implementing Bayesian evaluation methods.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Avoid CLT for small LLM evaluation datasets

Recommend frequentist and Bayesian alternatives

Provide Python library for Bayesian methods

🔎 Similar Papers

No similar papers found.