🤖 AI Summary
Existing LLM evaluation relies on single-prompt measurements, yet LLMs exhibit high sensitivity to minor prompt perturbations, yielding unreliable and non-reproducible performance estimates.
Method: We propose the first formal definition of *reliable evaluation* and introduce a gradient-free, task-agnostic randomized evaluation framework. It operates over a semantic-preserving prompt perturbation space, quantifying prompt sensitivity via moment estimation (first- and second-order) and statistical confidence interval analysis, while adaptively determining optimal sampling size.
Contribution/Results: The framework is compatible with arbitrary LLMs, tasks, and evaluation metrics. Experiments across five state-of-the-art models—including GPT-4o and Claude-3.7-Sonnet—demonstrate that single-prompt evaluations frequently incur biases exceeding 15%, whereas our method substantially improves assessment stability and reproducibility without requiring model access or fine-tuning.
📝 Abstract
LLMs are highly sensitive to prompt phrasing, yet standard benchmarks typically report performance using a single prompt, raising concerns about the reliability of such evaluations. In this work, we argue for a stochastic method of moments evaluation over the space of meaning-preserving prompt perturbations. We introduce a formal definition of reliable evaluation that accounts for prompt sensitivity, and suggest ReliableEval - a method for estimating the number of prompt resamplings needed to obtain meaningful results. Using our framework, we stochastically evaluate five frontier LLMs and find that even top-performing models like GPT-4o and Claude-3.7-Sonnet exhibit substantial prompt sensitivity. Our approach is model-, task-, and metric-agnostic, offering a recipe for meaningful and robust LLM evaluation.