Structured Prompting Enables More Robust, Holistic Evaluation of Language Models

📅 2025-11-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LM evaluation frameworks (e.g., HELM) rely on fixed prompts, suffering from poor generalizability and consequently underestimating model performance and yielding inconsistent cross-model rankings. To address this, we propose DSPy+HELM—a novel integrated framework that systematically incorporates structured prompting strategies (including chain-of-thought, self-consistency, least-to-most, and program-of-thought) into standardized evaluation. We conduct reproducible, large-scale assessments of state-of-the-art LMs across seven diverse benchmarks—four general-purpose and three medical—using declarative prompt optimization to explicitly elicit and enhance model reasoning capabilities. Our approach significantly improves evaluation robustness: average accuracy increases by 4%, result variance decreases by 2%, and ranking reversals occur in 3/7 leaderboards, yielding more accurate estimates of true model capability ceilings. All prompt optimization pipelines and integration tools are open-sourced to strengthen decision utility and experimental reproducibility.

Technology Category

Application Category

📝 Abstract
As language models (LMs) are increasingly adopted across domains, high-quality benchmarking frameworks that accurately estimate performance are essential for guiding deployment decisions. While frameworks such as Holistic Evaluation of Language Models (HELM) enable broad evaluation across tasks, they often rely on fixed prompts that fail to generalize across LMs, yielding unrepresentative performance estimates. Unless we estimate each LM's ceiling (maximum achievable via changes to the prompt), we risk underestimating performance. Declarative prompting frameworks, such as DSPy, offer a scalable alternative to manual prompt engineering by crafting structured prompts that can be optimized per task. However, such frameworks have not been systematically evaluated across established benchmarks. We present a reproducible DSPy+HELM framework that introduces structured prompting methods which elicit reasoning, enabling more accurate LM benchmarking. Using four prompting methods, we evaluate four frontier LMs across seven benchmarks (general/medical domain) against existing HELM baseline scores. We find that without structured prompting: (i) HELM underestimates LM performance (by 4% average), (ii) performance estimates vary more across benchmarks (+2% standard deviation), (iii) performance gaps are misrepresented (leaderboard rankings flip on 3/7 benchmarks), and (iv) introducing reasoning (chain-of-thought) reduces LM sensitivity to prompt design (smaller Δ across prompts). To our knowledge, this is the first large-scale benchmarking study to empirically characterize LM behavior across benchmarks and prompting methods, showing that scalable performance ceiling estimation enables more decision-useful benchmarks. We open-source (i) DSPy+HELM Integration (https://github.com/stanford-crfm/helm/pull/3893) and (ii) Prompt Optimization Pipeline (https://github.com/StanfordMIMI/dspy-helm).
Problem

Research questions and friction points this paper is trying to address.

Fixed prompts in benchmarks underestimate language model performance
Structured prompting enables more accurate performance ceiling estimation
Scalable prompting methods reduce sensitivity to prompt design variations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses structured prompting methods to elicit reasoning
Integrates DSPy with HELM for reproducible LM benchmarking
Optimizes prompts per task to estimate performance ceilings
🔎 Similar Papers
No similar papers found.