🤖 AI Summary
Multimodal large language models (MLLMs) exhibit limited capability in solving complex mathematical reasoning problems due to insufficient granular, stepwise reasoning over multimodal inputs.
Method: This paper introduces the “slow-thinking” paradigm, integrating long-chain, atomic-level reasoning into MLLMs via AtomThink—a novel atomic thinking framework comprising (i) an automatic Chain-of-Thought (CoT) annotation engine, (ii) atomic-step fine-tuning, and (iii) a policy-based search method guided by a four-category strategy reward model (PRM). The approach unifies vision–math joint fine-tuning, reinforcement learning–driven search, and interpretable CoT generation.
Contribution/Results: We release AtomMATH, a large-scale multimodal mathematical dataset, and propose fine-grained atomic capability evaluation metrics. On MathVista and MathVerse benchmarks, our method achieves relative accuracy improvements of ~50% and ~120%, respectively, significantly enhancing MLLMs’ hierarchical, adaptive reasoning on complex mathematical problems.
📝 Abstract
In this paper, we address the challenging task of multimodal mathematical reasoning by incorporating the ability of ``slow thinking"into multimodal large language models (MLLMs). Contrary to existing methods that rely on direct or fast thinking, our key idea is to construct long chains of thought (CoT) consisting of atomic actions in a step-by-step manner, guiding MLLMs to perform complex reasoning. To this end, we design a novel AtomThink framework composed of three key modules: (i) a CoT annotation engine that automatically generates high-quality CoT annotations to address the lack of high-quality visual mathematical data; (ii) an atomic step fine-tuning strategy that jointly optimizes an MLLM and a policy reward model (PRM) for step-wise reasoning; and (iii) four different search strategies that can be applied with the PRM to complete reasoning. Additionally, we propose AtomMATH, a large-scale multimodal dataset of long CoTs, and an atomic capability evaluation metric for mathematical tasks. Extensive experimental results show that the proposed AtomThink significantly improves the performance of baseline MLLMs, achieving approximately 50% relative accuracy gains on MathVista and 120% on MathVerse. To support the advancement of multimodal slow-thinking models, we will make our code and dataset publicly available on https://github.com/Quinn777/AtomThink.