Unveiling Fine-Grained Visual Traces: Evaluating Multimodal Interleaved Reasoning Chains in Multimodal STEM Tasks

πŸ“… 2026-04-21
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

176K/year
πŸ€– AI Summary
Existing multimodal large language model benchmarks struggle to effectively evaluate cross-modal reasoning in STEM tasks, often confounded by modality redundancy and neglecting intermediate reasoning steps. To address this, this work proposes StepSTEMβ€”a graduate-level multimodal reasoning benchmark spanning mathematics, physics, chemistry, biology, and engineering, comprising 283 problems. It introduces a novel problem-construction mechanism that enforces complementary use of text and images, along with a step-level evaluation framework supporting multiple reference solutions. Leveraging dynamic programming to align reasoning chains, the framework holistically assesses interleaved textual and visual reasoning, moving beyond paradigms that rely solely on final answers. Experiments reveal that even state-of-the-art models such as Gemini 1.5 Pro and Claude 3 Opus achieve only 38.29% accuracy, underscoring the significant challenges and untapped potential in cross-modal STEM reasoning.

Technology Category

Application Category

πŸ“ Abstract
Multimodal large language models (MLLMs) have shown promising reasoning abilities, yet evaluating their performance in specialized domains remains challenging. STEM reasoning is a particularly valuable testbed because it provides highly verifiable feedback, but existing benchmarks often permit unimodal shortcuts due to modality redundancy and focus mainly on final-answer accuracy, overlooking the reasoning process itself. To address this challenge, we introduce StepSTEM: a graduate-level benchmark of 283 problems across mathematics, physics, chemistry, biology, and engineering for fine-grained evaluation of cross-modal reasoning in MLLMs. StepSTEM is constructed through a rigorous curation pipeline that enforces strict complementarity between textual and visual inputs. We further propose a general step-level evaluation framework for both text-only chain-of-thought and interleaved image-text reasoning, using dynamic programming to align predicted reasoning steps with multiple reference solutions. Experiments across a wide range of models show that current MLLMs still rely heavily on textual reasoning, with even Gemini 3.1 Pro and Claude Opus 4.6 achieving only 38.29% accuracy. These results highlight substantial headroom for genuine cross-modal STEM reasoning and position StepSTEM as a benchmark for fine-grained evaluation of multimodal reasoning. Source code is available at https://github.com/lll-hhh/STEPSTEM.
Problem

Research questions and friction points this paper is trying to address.

multimodal reasoning
STEM tasks
reasoning evaluation
modality redundancy
fine-grained evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal reasoning
fine-grained evaluation
interleaved image-text reasoning
step-level alignment
modality complementarity
πŸ”Ž Similar Papers
No similar papers found.