🤖 AI Summary
Existing large multimodal models (LMMs) lack systematic evaluation of multi-step reasoning capabilities on academically complex images.
Method: We introduce SCI-Reason—the first benchmark for structured, domain-specific multimodal reasoning—comprising 12,000 real-world scientific images from PubMed and corresponding chain-of-thought (CoT)-annotated question-answer pairs. Crucially, it employs structured, multi-step vision-language reasoning chains as supervisory signals.
Contribution/Results: Our analysis across eight state-of-the-art LLMs/LMMs reveals that the primary failure mode is reasoning-chain breakdown—not insufficient visual feature extraction—a finding previously unreported. The best-performing model achieves only 55.19% accuracy, underscoring the task’s difficulty. Empirically, fine-tuning on SCI-Reason significantly improves both reasoning fidelity and cross-domain visual question answering (VQA) generalization of open-source models, establishing a new foundation for advancing scientific multimodal reasoning.
📝 Abstract
Large Language Models (LLMs) and Large Multimodal Models (LMMs) demonstrate impressive problem-solving skills in many tasks and domains. However, their ability to reason with complex images in academic domains has not been systematically investigated. To bridge this gap, we present SCI-Reason, a dataset for complex multimodel reasoning in academic areas. SCI-Reason aims to test and improve the reasoning ability of large multimodal models using real complex images in academic domains. The dataset contains 12,066 images and 12,626 question-answer pairs extracted from PubMed, divided into training, validation and test splits. Each question-answer pair also contains an accurate and efficient inference chain as a guide to improving the inference properties of the dataset. With SCI-Reason, we performed a comprehensive evaluation of 8 well-known models. The best performing model, Claude-3.7-Sonnet, only achieved an accuracy of 55.19%. Error analysis shows that more than half of the model failures are due to breakdowns in multi-step inference chains rather than errors in primary visual feature extraction. This finding underscores the inherent limitations in reasoning capabilities exhibited by current multimodal models when processing complex image analysis tasks within authentic academic contexts. Experiments on open-source models show that SCI-Reason not only enhances reasoning ability but also demonstrates cross-domain generalization in VQA tasks. We also explore future applications of model inference capabilities in this domain, highlighting its potential for future research.