🤖 AI Summary
This work addresses the challenge of evaluating large language models’ (LLMs) ability to design ablation experiments in scientific research—a task where existing automated evaluation methods lack reliability for complex scholarly reasoning. To this end, we introduce AbGen, the first dedicated benchmark comprising 1,500 expert-annotated ablation designs drawn from 807 NLP papers, alongside the AbGen-Eval framework for systematic assessment. Our methodology integrates human annotation, LLM-as-judge scoring, and meta-evaluation to enable rigorous human–machine comparative analysis. Results reveal that state-of-the-art LLMs significantly underperform human experts across three critical dimensions: importance, fidelity, and plausibility of ablation proposals; furthermore, automated evaluation scores exhibit systematic divergence from human judgments. This study not only exposes fundamental limitations in current LLMs’ higher-order scientific reasoning capabilities but also establishes a novel, reliable evaluation paradigm tailored to advanced scientific inference.
📝 Abstract
We introduce AbGen, the first benchmark designed to evaluate the capabilities of LLMs in designing ablation studies for scientific research. AbGen consists of 1,500 expert-annotated examples derived from 807 NLP papers. In this benchmark, LLMs are tasked with generating detailed ablation study designs for a specified module or process based on the given research context. Our evaluation of leading LLMs, such as DeepSeek-R1-0528 and o4-mini, highlights a significant performance gap between these models and human experts in terms of the importance, faithfulness, and soundness of the ablation study designs. Moreover, we demonstrate that current automated evaluation methods are not reliable for our task, as they show a significant discrepancy when compared to human assessment. To better investigate this, we develop AbGen-Eval, a meta-evaluation benchmark designed to assess the reliability of commonly used automated evaluation systems in measuring LLM performance on our task. We investigate various LLM-as-Judge systems on AbGen-Eval, providing insights for future research on developing more effective and reliable LLM-based evaluation systems for complex scientific tasks.