🤖 AI Summary
This work addresses three core challenges in scientific question answering: evidence retrieval, unanswerable question detection, and long-answer generation. To this end, we introduce PeerQA—the first document-level scientific QA dataset derived from authentic peer reviews—comprising 579 author-annotated QA pairs spanning machine learning, natural language processing, geoscience, and public health. We innovatively reformulate reviewer critiques as structured QA tasks and, for the first time, demonstrate that *decontextualization* significantly improves cross-architecture document retrieval. PeerQA establishes a new benchmark for scientific QA under long-context settings, with average input lengths of 12K tokens. Through a multitask evaluation framework, we empirically reveal substantial limitations of current large language models in modeling lengthy scientific texts. All data and code are publicly released to foster reproducible scientific QA research.
📝 Abstract
We present PeerQA, a real-world, scientific, document-level Question Answering (QA) dataset. PeerQA questions have been sourced from peer reviews, which contain questions that reviewers raised while thoroughly examining the scientific article. Answers have been annotated by the original authors of each paper. The dataset contains 579 QA pairs from 208 academic articles, with a majority from ML and NLP, as well as a subset of other scientific communities like Geoscience and Public Health. PeerQA supports three critical tasks for developing practical QA systems: Evidence retrieval, unanswerable question classification, and answer generation. We provide a detailed analysis of the collected dataset and conduct experiments establishing baseline systems for all three tasks. Our experiments and analyses reveal the need for decontextualization in document-level retrieval, where we find that even simple decontextualization approaches consistently improve retrieval performance across architectures. On answer generation, PeerQA serves as a challenging benchmark for long-context modeling, as the papers have an average size of 12k tokens. Our code and data is available at https://github.com/UKPLab/peerqa.