🤖 AI Summary
Large language models (LLMs) exhibit limited performance in mathematical reasoning—particularly geometry—primarily due to the absence of autonomous, high-fidelity visual reasoning capabilities. To address this, we propose MathCanvas, the first framework enabling end-to-end *intrinsic visual chain-of-thought* reasoning in multimodal large models. It empowers models to autonomously decide *when* to generate visualizations, *what* to draw, and *how* to edit them, thereby achieving tight text-image co-reasoning. Our method employs a two-stage training paradigm: (1) visual operation pretraining on large-scale image-text pairs augmented with human-annotated editing trajectories, followed by (2) policy-driven visual reasoning fine-tuning using interleaved reasoning-path data. The resulting BAGEL-Canvas model achieves an 86% relative improvement over strong baselines on our newly constructed MathCanvas-Bench and demonstrates strong generalization across multiple public mathematical reasoning benchmarks, significantly advancing intrinsic visual reasoning in foundation models.
📝 Abstract
While Large Language Models (LLMs) have excelled in textual reasoning, they struggle with mathematical domains like geometry that intrinsically rely on visual aids. Existing approaches to Visual Chain-of-Thought (VCoT) are often limited by rigid external tools or fail to generate the high-fidelity, strategically-timed diagrams necessary for complex problem-solving. To bridge this gap, we introduce MathCanvas, a comprehensive framework designed to endow unified Large Multimodal Models (LMMs) with intrinsic VCoT capabilities for mathematics. Our approach consists of two phases. First, a Visual Manipulation stage pre-trains the model on a novel 15.2M-pair corpus, comprising 10M caption-to-diagram pairs (MathCanvas-Imagen) and 5.2M step-by-step editing trajectories (MathCanvas-Edit), to master diagram generation and editing. Second, a Strategic Visual-Aided Reasoning stage fine-tunes the model on MathCanvas-Instruct, a new 219K-example dataset of interleaved visual-textual reasoning paths, teaching it when and how to leverage visual aids. To facilitate rigorous evaluation, we introduce MathCanvas-Bench, a challenging benchmark with 3K problems that require models to produce interleaved visual-textual solutions. Our model, BAGEL-Canvas, trained under this framework, achieves an 86% relative improvement over strong LMM baselines on MathCanvas-Bench, demonstrating excellent generalization to other public math benchmarks. Our work provides a complete toolkit-framework, datasets, and benchmark-to unlock complex, human-like visual-aided reasoning in LMMs. Project Page: https://mathcanvas.github.io/