🤖 AI Summary
This work addresses the challenge that current AI research agents struggle to effectively evaluate the methodological soundness of scientific proposals at early stages, often overestimating low-quality submissions. To tackle this, the authors introduce SoundnessBench, a benchmark comprising 1,099 ICLR submission proposals annotated with fine-grained reviewer scores on scientific rigor, validated through source-paper auditing for label reliability. Focusing for the first time on recoverable scientific assessment at the proposal stage, the study combines large language model (LLM) evaluations, human audits, adversarial prompting, and rigorous control experiments—including corpus contamination checks and surface-feature ablations—to systematically reveal a pervasive optimism bias in mainstream LLMs. Even with optimized prompting strategies, these models fail to simultaneously achieve high precision and recall, indicating they are not yet reliable for autonomous preliminary screening of research proposals.
📝 Abstract
Autonomous AI research agents aim to accelerate scientific discovery by automating the research pipeline, from hypothesis generation to peer review. However, existing benchmarks rarely test a fundamental bottleneck: whether Large Language Models can judge the methodological viability of a research idea before expending time and computational resources. We introduce SoundnessBench, a curated benchmark of 1,099 machine-learning research proposals reconstructed from ICLR submissions, labeled with reviewer soundness sub-scores, and audited against source papers. SoundnessBench should be interpreted as a benchmark for recoverable proposal-stage soundness rather than exact prediction of full-paper review outcomes. Across 12 frontier LLMs, we find a pervasive optimism bias: under standard prompting, models frequently rate low-soundness proposals as sound, while aggressive prompting largely shifts errors from false positives to false negatives. Additional controls for public-corpus contamination, paper-identifying phrases, surface features, and human audit quality suggest that this behavior is not explained by a single confounder. Our results indicate that current LLMs are not yet reliable as standalone first-gate evaluators for scientific rigor.