🤖 AI Summary
Current multimodal large language models (MLLMs) rely solely on static image inputs for mathematical reasoning, lacking human-like dynamic visual observation and stepwise verification capabilities. To address this, we propose Reason Chunking—a mechanism that decomposes reasoning into cognitively grounded, key logical units termed Cognitive Reasoning Units (CRUs)—enabling synergistic modeling of dynamic visual perception and proposition-level incremental verification. Our contributions are threefold: (1) the first CRU design, structured according to Miller’s Law to align with human cognitive capacity; (2) CRUX, the first dataset featuring explicit multi-path CRU annotations; and (3) a three-stage progressive training paradigm—comprising instructional and practice supervised fine-tuning (SFT), followed by strategic reinforcement learning (RL). The resulting ViRC-7B model achieves an average 18.8% improvement across multiple mathematical reasoning benchmarks, substantially outperforming existing MLLM baselines. Code is publicly available.
📝 Abstract
CoT has significantly enhanced the reasoning ability of LLMs while it faces challenges when extended to multimodal domains, particularly in mathematical tasks. Existing MLLMs typically perform textual reasoning solely from a single static mathematical image, overlooking dynamic visual acquisition during reasoning. In contrast, humans repeatedly examine visual image and employ step-by-step reasoning to prove intermediate propositions. This strategy of decomposing the problem-solving process into key logical nodes adheres to Miller's Law in cognitive science. Inspired by this insight, we propose a ViRC framework for multimodal mathematical tasks, introducing a Reason Chunking mechanism that structures multimodal mathematical CoT into consecutive Critical Reasoning Units (CRUs) to simulate human expert problem-solving patterns. CRUs ensure intra-unit textual coherence for intermediate proposition verification while integrating visual information across units to generate subsequent propositions and support structured reasoning. To this end, we present CRUX dataset by using three visual tools and four reasoning patterns to provide explicitly annotated CRUs across multiple reasoning paths for each mathematical problem. Leveraging the CRUX dataset, we propose a progressive training strategy inspired by human cognitive learning, which includes Instructional SFT, Practice SFT, and Strategic RL, aimed at further strengthening the Reason Chunking ability of the model.The resulting ViRC-7B model achieves a 18.8% average improvement over baselines across multiple mathematical benchmarks. Code is available at https://github.com/Leon-LihongWang/ViRC.