🤖 AI Summary
Scientific papers often report research limitations in vague or imprecise ways, undermining reproducibility and scholarly trust. To address this, we introduce the first end-to-end benchmark specifically designed for limitations—encompassing automatic extraction, generation, and dual-layer evaluation (fine-grained + meta-evaluation). Our contributions include: (1) a limitations-oriented Retrieval-Augmented Generation (RAG) framework; (2) a high-quality, manually annotated dataset integrating papers from major venues (ACL, NeurIPS, PeerJ) with external peer reviews; and (3) a suite of multidimensional automated evaluation metrics alongside a rigorous meta-evaluation protocol. Experiments demonstrate that our approach significantly improves the relevance and verifiability of generated limitations, while the evaluation framework exhibits strong discriminative power and robustness across diverse models and settings. All data, annotations, and code are publicly released to advance AI-assisted research integrity.
📝 Abstract
In scientific research, limitations refer to the shortcomings, constraints, or weaknesses within a study. Transparent reporting of such limitations can enhance the quality and reproducibility of research and improve public trust in science. However, authors often a) underreport them in the paper text and b) use hedging strategies to satisfy editorial requirements at the cost of readers' clarity and confidence. This underreporting behavior, along with an explosion in the number of publications, has created a pressing need to automatically extract or generate such limitations from scholarly papers. In this direction, we present a complete architecture for the computational analysis of research limitations. Specifically, we create a dataset of limitations in ACL, NeurIPS, and PeerJ papers by extracting them from papers' text and integrating them with external reviews; we propose methods to automatically generate them using a novel Retrieval Augmented Generation (RAG) technique; we create a fine-grained evaluation framework for generated limitations; and we provide a meta-evaluation for the proposed evaluation techniques.