PAPERMIND: Benchmarking Agentic Reasoning and Critique over Scientific Papers in Multimodal LLMs

📅 2026-04-23

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Existing benchmarks inadequately assess large models’ holistic understanding of scientific papers, as they typically emphasize isolated skills and overlook the synergistic cognitive processes inherent in authentic scientific reading. To address this gap, this work introduces a unified evaluation framework that conceptualizes scientific paper comprehension as an integrative, agent-oriented cognitive process. The framework spans seven scientific disciplines and incorporates both textual and visual modalities, structured around four complementary task families: multimodal alignment, experimental interpretation, cross-paper evidence-based reasoning, and critical evaluation. Built upon real scientific publications, the accompanying multimodal dataset enables systematic assessment of state-of-the-art multimodal large language models. Empirical evaluations reveal consistent and significant performance bottlenecks in integrative scientific reasoning and critical appraisal, underscoring the benchmark’s diagnostic utility and its capacity to pose meaningful challenges for future model development.

Technology Category

Application Category

📝 Abstract

Understanding scientific papers requires more than answering isolated questions or summarizing content. It involves an integrated reasoning process that grounds textual and visual information, interprets experimental evidence, synthesizes information across sources, and critically evaluates scientific claims. However, existing benchmarks typically assess these abilities in isolation, making it difficult to evaluate scientific paper understanding as a unified set of interacting cognitive abilities. In this work, we introduce PAPERMIND, a benchmark designed to evaluate integrated and agent-oriented scientific reasoning over research papers. PAPERMIND is constructed from real scientific papers across seven domains, including agriculture, biology, chemistry, computer science, medicine, physics, and economics. It comprises four complementary task families that collectively operationalize distinct cognitive facets of scientific paper reasoning, including multimodal grounding, experimental interpretation, cross-source evidence reasoning, and critical assessment. By analyzing model behavior across multiple tasks, PAPERMIND enables a diagnostic evaluation of integrated scientific reasoning behaviors that are difficult to assess through isolated task evaluations. Extensive experiments on both opensource and closed-source multimodal LLMs reveal consistent performance gaps across tasks, highlighting persistent challenges in integrated scientific reasoning and critique. Our benchmark and dataset are available at https:// github.com/Yanjun-Zhao/PaperMind.

Problem

Research questions and friction points this paper is trying to address.

scientific paper understanding

multimodal LLMs

integrated reasoning

benchmarking

agent-oriented reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

agentic reasoning

multimodal LLMs

scientific paper understanding