🤖 AI Summary
This work addresses the limitations of current foundation models in comprehending scientific papers—particularly their struggles with domain-specific terminology and complex figures—and the lack of fine-grained, large-scale evaluation benchmarks. To bridge this gap, the authors introduce the first fine-grained question-answering benchmark for scientific paper understanding, constructed from 15K human-verified peer review and rebuttal dialogues in computer science. Questions are systematically categorized into three types—“why,” “what,” and “how”—aligned with the scientific research workflow. The study further proposes a collaborative annotation framework involving both large language models and human experts, along with a multidimensional automatic evaluation paradigm that jointly assesses correctness, completeness, and conciseness. Experiments reveal that even the strongest model (GPT-5) achieves only 68.2% on correctness-completeness and drops to 37.46% after conciseness adjustment, highlighting significant deficiencies in precise academic comprehension.
📝 Abstract
Understanding research papers remains challenging for foundation models due to specialized scientific discourse and complex figures and tables, yet existing benchmarks offer limited fine-grained evaluation at scale. To address this gap, we introduce RPC-Bench, a large-scale question-answering benchmark built from review-rebuttal exchanges of high-quality computer science papers, containing 15K human-verified QA pairs. We design a fine-grained taxonomy aligned with the scientific research flow to assess models'ability to understand and answer why, what, and how questions in scholarly contexts. We also define an elaborate LLM-human interaction annotation framework to support large-scale labeling and quality control. Following the LLM-as-a-Judge paradigm, we develop a scalable framework that evaluates models on correctness-completeness and conciseness, with high agreement to human judgment. Experiments reveal that even the strongest models (GPT-5) achieve only 68.2% correctness-completeness, dropping to 37.46% after conciseness adjustment, highlighting substantial gaps in precise academic paper understanding. Our code and data are available at https://rpc-bench.github.io/.