Can Atomic Step Decomposition Enhance the Self-structured Reasoning of Multimodal Large Models?

📅 2025-03-08

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Multimodal mathematical reasoning suffers from weak structured reasoning capabilities and susceptibility to over-reasoning. Method: This paper proposes the Self-Structured Chain-of-Thought (SCoT) paradigm, introducing a novel “slow-thinking” mechanism that performs atomic-level semantic decomposition and dynamic recomposition. Built upon SCoT, we develop AtomThink—a framework comprising four core components: an atomized data engine, serialized supervised fine-tuning, policy-guided multi-round reasoning, and atomic capability measurement. The approach integrates multimodal large language model fine-tuning, cognitively grounded reasoning path generation, and interpretable evaluation. Results: AtomThink achieves over 10% average accuracy gain on MathVista and MathVerse. Compared to state-of-the-art structured CoT methods, it improves data utilization efficiency by 5× and inference speed by 85.3%, while significantly enhancing reasoning controllability and generalization.

Technology Category

Application Category

📝 Abstract

In this paper, we address the challenging task of multimodal mathematical reasoning by incorporating the ability of"slow thinking"into multimodal large language models (MLLMs). Our core idea is that different levels of reasoning abilities can be combined dynamically to tackle questions with different complexity. To this end, we propose a paradigm of Self-structured Chain of Thought (SCoT), which is composed of minimal semantic atomic steps. Different from existing methods that rely on structured templates or free-form paradigms, our method can not only generate cognitive CoT structures for various complex tasks but also mitigates the phenomenon of overthinking. To introduce structured reasoning capabilities into visual understanding models, we further design a novel AtomThink framework with four key modules, including (i) a data engine to generate high-quality multimodal reasoning paths; (ii) a supervised fine-tuning process with serialized inference data; (iii) a policy-guided multi-turn inference method; and (iv) an atomic capability metric to evaluate the single step utilization rate. We conduct extensive experiments to show that the proposed AtomThink significantly improves the performance of baseline MLLMs, achieving more than 10% average accuracy gains on MathVista and MathVerse. Compared to state-of-the-art structured CoT approaches, our method not only achieves higher accuracy but also improves data utilization by 5 times and boosts inference efficiency by 85.3%. Our code is now public available in https://github.com/Quinn777/AtomThink.

Problem

Research questions and friction points this paper is trying to address.

Enhance multimodal mathematical reasoning in large models

Introduce structured reasoning to visual understanding models

Improve accuracy and efficiency in complex task reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-structured Chain of Thought (SCoT) paradigm

AtomThink framework with four key modules

Policy-guided multi-turn inference method

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting