ResearchQA: Evaluating Scholarly Question Answering at Scale Across 75 Fields with Survey-Mined Questions and Rubrics

📅 2025-08-30

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This study addresses the challenge of evaluating long-form scientific question answering across 75 interdisciplinary domains, mitigating overreliance on scarce expert annotations. To this end, we introduce ResearchQA—a large-scale academic QA benchmark comprising 21,000 real research questions and 160,000 fine-grained evaluation criteria, systematically extracted *in tandem* from survey literature for the first time. We propose a scalable, multi-dimensional evaluation framework and train an automatic pairwise discrimination model achieving 74% agreement with domain experts. Validation by 31 PhD-level annotators confirms 96% of questions exhibit scientific authenticity and 87% of criteria require substantive responses. Comprehensive evaluation of 18 state-of-the-art systems reveals none satisfies more than 75% of the criteria, exposing critical deficiencies—particularly in citation compliance and limitations analysis—across key dimensions.

Technology Category

Application Category

📝 Abstract

Evaluating long-form responses to research queries heavily relies on expert annotators, restricting attention to areas like AI where researchers can conveniently enlist colleagues. Yet, research expertise is widespread: survey articles synthesize knowledge distributed across the literature. We introduce ResearchQA, a resource for evaluating LLM systems by distilling survey articles from 75 research fields into 21K queries and 160K rubric items. Each rubric, derived jointly with queries from survey sections, lists query-specific answer evaluation criteria, i.e., citing papers, making explanations, and describing limitations. Assessments by 31 Ph.D. annotators in 8 fields indicate 96% of queries support Ph.D. information needs and 87% of rubric items should be addressed in system responses by a sentence or more. Using our rubrics, we are able to construct an automatic pairwise judge obtaining 74% agreement with expert judgments. We leverage ResearchQA to analyze competency gaps in 18 systems in over 7.6K pairwise evaluations. No parametric or retrieval-augmented system we evaluate exceeds 70% on covering rubric items, and the highest-ranking agentic system shows 75% coverage. Error analysis reveals that the highest-ranking system fully addresses less than 11% of citation rubric items, 48% of limitation items, and 49% of comparison items. We release our data to facilitate more comprehensive multi-field evaluations.

Problem

Research questions and friction points this paper is trying to address.

Evaluating long-form scholarly question answering across diverse research fields

Creating scalable expert-level rubrics for automated LLM response assessment

Addressing competency gaps in citation, limitation, and comparison handling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Distilling survey articles into queries and rubrics

Creating automatic pairwise judge with expert agreement

Analyzing competency gaps across multiple LLM systems

🔎 Similar Papers

Ranking Generated Answers: On the Agreement of Retrieval Models with Humans on Consumer Health Questions

2024-08-19arXiv.orgCitations: 0

💼 Related Jobs

Machine Learning Engineer, PhD Intern

Instacart

CA, NY, CT, NJ$50—$50 USDWA$47.50—$47.50 USDOR, DE, ME, MA, MD, NH, RI, VT, DC, PA, VA, CO, TX, IL, HI$44—$44 USDAll other states$42—$42 USD

remote

Research Scientist Intern, Multimodal AI (PhD)