ViRC: Enhancing Visual Interleaved Mathematical CoT with Reason Chunking

📅 2025-12-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current multimodal large language models (MLLMs) rely solely on static image inputs for mathematical reasoning, lacking human-like dynamic visual observation and stepwise verification capabilities. To address this, we propose Reason Chunking—a mechanism that decomposes reasoning into cognitively grounded, key logical units termed Cognitive Reasoning Units (CRUs)—enabling synergistic modeling of dynamic visual perception and proposition-level incremental verification. Our contributions are threefold: (1) the first CRU design, structured according to Miller’s Law to align with human cognitive capacity; (2) CRUX, the first dataset featuring explicit multi-path CRU annotations; and (3) a three-stage progressive training paradigm—comprising instructional and practice supervised fine-tuning (SFT), followed by strategic reinforcement learning (RL). The resulting ViRC-7B model achieves an average 18.8% improvement across multiple mathematical reasoning benchmarks, substantially outperforming existing MLLM baselines. Code is publicly available.

Technology Category

Application Category

📝 Abstract
CoT has significantly enhanced the reasoning ability of LLMs while it faces challenges when extended to multimodal domains, particularly in mathematical tasks. Existing MLLMs typically perform textual reasoning solely from a single static mathematical image, overlooking dynamic visual acquisition during reasoning. In contrast, humans repeatedly examine visual image and employ step-by-step reasoning to prove intermediate propositions. This strategy of decomposing the problem-solving process into key logical nodes adheres to Miller's Law in cognitive science. Inspired by this insight, we propose a ViRC framework for multimodal mathematical tasks, introducing a Reason Chunking mechanism that structures multimodal mathematical CoT into consecutive Critical Reasoning Units (CRUs) to simulate human expert problem-solving patterns. CRUs ensure intra-unit textual coherence for intermediate proposition verification while integrating visual information across units to generate subsequent propositions and support structured reasoning. To this end, we present CRUX dataset by using three visual tools and four reasoning patterns to provide explicitly annotated CRUs across multiple reasoning paths for each mathematical problem. Leveraging the CRUX dataset, we propose a progressive training strategy inspired by human cognitive learning, which includes Instructional SFT, Practice SFT, and Strategic RL, aimed at further strengthening the Reason Chunking ability of the model.The resulting ViRC-7B model achieves a 18.8% average improvement over baselines across multiple mathematical benchmarks. Code is available at https://github.com/Leon-LihongWang/ViRC.
Problem

Research questions and friction points this paper is trying to address.

Enhances multimodal mathematical reasoning with visual chunking
Simulates human step-by-step problem-solving in visual math tasks
Improves model performance on mathematical benchmarks through structured training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Reason Chunking for structured multimodal reasoning
Uses Critical Reasoning Units to simulate human problem-solving
Employs progressive training strategy to enhance chunking ability
🔎 Similar Papers
No similar papers found.
Lihong Wang
Lihong Wang
Jilin University
L
Liangqi Li
Ant Group
W
Weiwei Feng
Ant Group
J
Jiamin Wu
The Chinese University of Hong Kong
Changtao Miao
Changtao Miao
University of Science and Technology of China
AI
T
Tieru Wu
Jilin University
R
Rui Ma
Jilin University
B
Bo Zhang
Ant Group
Z
Zhe Li
Ant Group