DOVE: A Large-Scale Multi-Dimensional Predictions Dataset Towards Meaningful LLM Evaluation

📅 2025-03-03

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

Current LLM evaluation relies heavily on single-prompt assessments, overlooking models’ joint sensitivity to multidimensional prompt perturbations—such as delimiters, enumeration formats, and instruction phrasing—leading to inflated performance estimates. Method: We introduce DOVE, the first large-scale benchmark for evaluating prompt robustness under *joint* multidimensional perturbations, built upon MMLU, GSM8K, and other standard benchmarks. It comprises over 250 million perturbed prompts and corresponding model responses, with orthogonal perturbation dimensions systematically designed. Using batched multi-model inference and statistical attribution analysis, we quantify sensitivity across perturbation types. Contribution/Results: We find few-shot examples substantially mitigate prompt sensitivity; identify intrinsically difficult instances consistently failing across perturbations; expose systematic bias in single-prompt evaluation; propose an efficient prompt selection strategy; and publicly release all data, code, and tools to advance reproducible, robust LLM evaluation.

Technology Category

Application Category

📝 Abstract

Recent work found that LLMs are sensitive to a wide range of arbitrary prompt dimensions, including the type of delimiters, answer enumerators, instruction wording, and more. This throws into question popular single-prompt evaluation practices. We present DOVE (Dataset Of Variation Evaluation) a large-scale dataset containing prompt perturbations of various evaluation benchmarks. In contrast to previous work, we examine LLM sensitivity from an holistic perspective, and assess the joint effects of perturbations along various dimensions, resulting in thousands of perturbations per instance. We evaluate several model families against DOVE, leading to several findings, including efficient methods for choosing well-performing prompts, observing that few-shot examples reduce sensitivity, and identifying instances which are inherently hard across all perturbations. DOVE consists of more than 250M prompt perturbations and model outputs, which we make publicly available to spur a community-wide effort toward meaningful, robust, and efficient evaluation. Browse the data, contribute, and more: https://slab-nlp.github.io/DOVE/

Problem

Research questions and friction points this paper is trying to address.

Assesses LLM sensitivity to diverse prompt dimensions.

Provides a dataset for robust and efficient LLM evaluation.

Identifies effective prompts and inherently challenging instances.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale dataset with prompt perturbations

Holistic assessment of LLM sensitivity

Publicly available 250M prompt perturbations

🔎 Similar Papers

No similar papers found.