🤖 AI Summary
This study addresses the challenge of evaluating long-form scientific question answering across 75 interdisciplinary domains, mitigating overreliance on scarce expert annotations. To this end, we introduce ResearchQA—a large-scale academic QA benchmark comprising 21,000 real research questions and 160,000 fine-grained evaluation criteria, systematically extracted *in tandem* from survey literature for the first time. We propose a scalable, multi-dimensional evaluation framework and train an automatic pairwise discrimination model achieving 74% agreement with domain experts. Validation by 31 PhD-level annotators confirms 96% of questions exhibit scientific authenticity and 87% of criteria require substantive responses. Comprehensive evaluation of 18 state-of-the-art systems reveals none satisfies more than 75% of the criteria, exposing critical deficiencies—particularly in citation compliance and limitations analysis—across key dimensions.
📝 Abstract
Evaluating long-form responses to research queries heavily relies on expert annotators, restricting attention to areas like AI where researchers can conveniently enlist colleagues. Yet, research expertise is widespread: survey articles synthesize knowledge distributed across the literature. We introduce ResearchQA, a resource for evaluating LLM systems by distilling survey articles from 75 research fields into 21K queries and 160K rubric items. Each rubric, derived jointly with queries from survey sections, lists query-specific answer evaluation criteria, i.e., citing papers, making explanations, and describing limitations. Assessments by 31 Ph.D. annotators in 8 fields indicate 96% of queries support Ph.D. information needs and 87% of rubric items should be addressed in system responses by a sentence or more. Using our rubrics, we are able to construct an automatic pairwise judge obtaining 74% agreement with expert judgments. We leverage ResearchQA to analyze competency gaps in 18 systems in over 7.6K pairwise evaluations. No parametric or retrieval-augmented system we evaluate exceeds 70% on covering rubric items, and the highest-ranking agentic system shows 75% coverage. Error analysis reveals that the highest-ranking system fully addresses less than 11% of citation rubric items, 48% of limitation items, and 49% of comparison items. We release our data to facilitate more comprehensive multi-field evaluations.