What MLLMs Learn about When they Learn about Multimodal Reasoning: Perception, Reasoning, or their Integration?

📅 2025-10-02

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Existing evaluations of multimodal reasoning models rely heavily on overall accuracy, failing to disentangle and reveal differential progress across constituent capabilities—such as perception, reasoning, and integration. Method: We introduce MathLens, the first benchmark to explicitly decouple geometric reasoning into three independently evaluable sub-capabilities. It incorporates fine-grained annotation techniques: symbolic diagram generation, controlled multimodal question design, and perception probes. Contribution/Results: Through comparative analysis of supervised fine-tuning (SFT) and reinforcement learning (RL) paradigms, we find RL substantially improves perceptual consistency and integration capacity, whereas multimodal SFT often degrades robustness. Crucially, reasoning gains remain constrained by perceptual foundations, and integration persists as the primary bottleneck. MathLens establishes a novel paradigm for fine-grained attribution analysis of multimodal reasoning abilities and provides an open, principled benchmarking toolkit.

Technology Category

Application Category

📝 Abstract

Multimodal reasoning models have recently shown promise on challenging domains such as olympiad-level geometry, yet their evaluation remains dominated by aggregate accuracy, a single score that obscures where and how models are improving. We introduce MathLens, a benchmark designed to disentangle the subskills of multimodal reasoning while preserving the complexity of textbook-style geometry problems. The benchmark separates performance into three components: Perception: extracting information from raw inputs, Reasoning: operating on available information, and Integration: selecting relevant perceptual evidence and applying it within reasoning. To support each test, we provide annotations: visual diagrams, textual descriptions to evaluate reasoning in isolation, controlled questions that require both modalities, and probes for fine-grained perceptual skills, all derived from symbolic specifications of the problems to ensure consistency and robustness. Our analysis reveals that different training approaches have uneven effects: First, reinforcement learning chiefly strengthens perception, especially when supported by textual supervision, while textual SFT indirectly improves perception through reflective reasoning. Second, reasoning improves only in tandem with perception. Third, integration remains the weakest capacity, with residual errors concentrated there once other skills advance. Finally, robustness diverges: RL improves consistency under diagram variation, whereas multimodal SFT reduces it through overfitting. We will release all data and experimental logs.

Problem

Research questions and friction points this paper is trying to address.

Disentangling subskills of multimodal reasoning in geometry problems

Evaluating perception, reasoning, and integration capabilities separately

Analyzing uneven effects of different training approaches on skills

Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed MathLens benchmark for multimodal reasoning evaluation

Separated performance into perception, reasoning, and integration components

Analyzed training effects on perception, reasoning, and integration skills

🔎 Similar Papers

What is the Visual Cognition Gap between Humans and Multimodal LLMs?