ReliableEval: A Recipe for Stochastic LLM Evaluation via Method of Moments

📅 2025-05-28

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Existing LLM evaluation relies on single-prompt measurements, yet LLMs exhibit high sensitivity to minor prompt perturbations, yielding unreliable and non-reproducible performance estimates. Method: We propose the first formal definition of *reliable evaluation* and introduce a gradient-free, task-agnostic randomized evaluation framework. It operates over a semantic-preserving prompt perturbation space, quantifying prompt sensitivity via moment estimation (first- and second-order) and statistical confidence interval analysis, while adaptively determining optimal sampling size. Contribution/Results: The framework is compatible with arbitrary LLMs, tasks, and evaluation metrics. Experiments across five state-of-the-art models—including GPT-4o and Claude-3.7-Sonnet—demonstrate that single-prompt evaluations frequently incur biases exceeding 15%, whereas our method substantially improves assessment stability and reproducibility without requiring model access or fine-tuning.

Technology Category

Application Category

📝 Abstract

LLMs are highly sensitive to prompt phrasing, yet standard benchmarks typically report performance using a single prompt, raising concerns about the reliability of such evaluations. In this work, we argue for a stochastic method of moments evaluation over the space of meaning-preserving prompt perturbations. We introduce a formal definition of reliable evaluation that accounts for prompt sensitivity, and suggest ReliableEval - a method for estimating the number of prompt resamplings needed to obtain meaningful results. Using our framework, we stochastically evaluate five frontier LLMs and find that even top-performing models like GPT-4o and Claude-3.7-Sonnet exhibit substantial prompt sensitivity. Our approach is model-, task-, and metric-agnostic, offering a recipe for meaningful and robust LLM evaluation.

Problem

Research questions and friction points this paper is trying to address.

Addressing LLM sensitivity to prompt phrasing in evaluations

Proposing stochastic evaluation over meaning-preserving prompt variations

Estimating required prompt resamplings for reliable LLM assessment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Stochastic evaluation over prompt perturbations

Estimates required prompt resamplings for reliability

Model-task-metric agnostic robust evaluation framework

🔎 Similar Papers

No similar papers found.