SciMDR: Benchmarking and Advancing Scientific Multimodal Document Reasoning

📅 2026-03-12

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This work addresses the inherent trade-offs among scale, faithfulness, and authenticity in scientific multimodal document reasoning datasets. To overcome these challenges, the authors propose the Synthesize-and-Reground framework, which constructs SciMDR—a large-scale, high-fidelity dataset for scientific multimodal reasoning—through a two-stage process: first synthesizing claim-centric question-answer pairs, then programmatically re-embedding them into full scientific documents. This approach is the first to systematically re-anchor locally faithful QA pairs within complete scientific papers, effectively balancing dataset scale, factual fidelity, and task complexity. The study also introduces SciMDR-Eval, an expert-annotated evaluation benchmark. Models fine-tuned on SciMDR demonstrate substantial performance gains across multiple scientific QA benchmarks, particularly excelling in tasks requiring complex document-level reasoning.

Technology Category

Application Category

📝 Abstract

Constructing scientific multimodal document reasoning datasets for foundation model training involves an inherent trade-off among scale, faithfulness, and realism. To address this challenge, we introduce the synthesize-and-reground framework, a two-stage pipeline comprising: (1) Claim-Centric QA Synthesis, which generates faithful, isolated QA pairs and reasoning on focused segments, and (2) Document-Scale Regrounding, which programmatically re-embeds these pairs into full-document tasks to ensure realistic complexity. Using this framework, we construct SciMDR, a large-scale training dataset for cross-modal comprehension, comprising 300K QA pairs with explicit reasoning chains across 20K scientific papers. We further construct SciMDR-Eval, an expert-annotated benchmark to evaluate multimodal comprehension within full-length scientific workflows. Experiments demonstrate that models fine-tuned on SciMDR achieve significant improvements across multiple scientific QA benchmarks, particularly in those tasks requiring complex document-level reasoning.

Problem

Research questions and friction points this paper is trying to address.

scientific multimodal document reasoning

dataset construction

scale-faithfulness-realism trade-off

foundation model training

multimodal comprehension

Innovation

Methods, ideas, or system contributions that make the work stand out.

synthesize-and-reground

scientific multimodal reasoning

document-scale regrounding