SCI-Reason: A Dataset with Chain-of-Thought Rationales for Complex Multimodal Reasoning in Academic Areas

📅 2025-04-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing large multimodal models (LMMs) lack systematic evaluation of multi-step reasoning capabilities on academically complex images. Method: We introduce SCI-Reason—the first benchmark for structured, domain-specific multimodal reasoning—comprising 12,000 real-world scientific images from PubMed and corresponding chain-of-thought (CoT)-annotated question-answer pairs. Crucially, it employs structured, multi-step vision-language reasoning chains as supervisory signals. Contribution/Results: Our analysis across eight state-of-the-art LLMs/LMMs reveals that the primary failure mode is reasoning-chain breakdown—not insufficient visual feature extraction—a finding previously unreported. The best-performing model achieves only 55.19% accuracy, underscoring the task’s difficulty. Empirically, fine-tuning on SCI-Reason significantly improves both reasoning fidelity and cross-domain visual question answering (VQA) generalization of open-source models, establishing a new foundation for advancing scientific multimodal reasoning.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) and Large Multimodal Models (LMMs) demonstrate impressive problem-solving skills in many tasks and domains. However, their ability to reason with complex images in academic domains has not been systematically investigated. To bridge this gap, we present SCI-Reason, a dataset for complex multimodel reasoning in academic areas. SCI-Reason aims to test and improve the reasoning ability of large multimodal models using real complex images in academic domains. The dataset contains 12,066 images and 12,626 question-answer pairs extracted from PubMed, divided into training, validation and test splits. Each question-answer pair also contains an accurate and efficient inference chain as a guide to improving the inference properties of the dataset. With SCI-Reason, we performed a comprehensive evaluation of 8 well-known models. The best performing model, Claude-3.7-Sonnet, only achieved an accuracy of 55.19%. Error analysis shows that more than half of the model failures are due to breakdowns in multi-step inference chains rather than errors in primary visual feature extraction. This finding underscores the inherent limitations in reasoning capabilities exhibited by current multimodal models when processing complex image analysis tasks within authentic academic contexts. Experiments on open-source models show that SCI-Reason not only enhances reasoning ability but also demonstrates cross-domain generalization in VQA tasks. We also explore future applications of model inference capabilities in this domain, highlighting its potential for future research.
Problem

Research questions and friction points this paper is trying to address.

Assessing multimodal models' reasoning with academic complex images
Enhancing inference chains for accurate academic visual question answering
Addressing limitations in multi-step reasoning for scientific image analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dataset SCI-Reason for complex multimodal academic reasoning
Includes 12,066 images and 12,626 QA pairs
Enhances reasoning and cross-domain generalization
🔎 Similar Papers
No similar papers found.
C
Chenghao Ma
Beijing University of Posts and Telecommunications
E
E. Haihong
Beijing University of Posts and Telecommunications
J
Junpeng Ding
Beijing University of Posts and Telecommunications
J
Jun Zhang
Beijing University of Posts and Telecommunications
Z
Ziyan Ma
Beijing University of Posts and Telecommunications
H
Huang Qing
Beijing University of Posts and Telecommunications
Bofei Gao
Bofei Gao
Peking University
Natural Language Processing
L
Liang Chen
Peking University
Meina Song
Meina Song
Professor of Computer Science, Beijing University of Posts and Telecommunications
data science