How RL Unlocks the Aha Moment in Geometric Interleaved Reasoning

📅 2026-03-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of multimodal large language models in geometric reasoning tasks, where supervised fine-tuning (SFT) merely imitates output formats without establishing causal dependencies between diagram generation and logical reasoning, thereby constraining performance. The authors propose Faire, a novel framework that, for the first time, elucidates the failure mechanism of SFT in such settings and introduces a reinforcement learning–based functional alignment paradigm. By enforcing three causality-inspired constraints, Faire shifts the model’s behavior from superficial imitation toward deep integration of drawing and reasoning processes. Extensive experiments demonstrate that this approach achieves state-of-the-art performance across multiple challenging geometric reasoning benchmarks, significantly enhancing the quality of model behavior and ensuring that generated diagrams genuinely support the reasoning process.

Technology Category

Application Category

📝 Abstract
Solving complex geometric problems inherently requires interleaved reasoning: a tight alternation between constructing diagrams and performing logical deductions. Although recent Multimodal Large Language Models (MLLMs) have demonstrated strong capabilities in visual generation and plotting, we identify a counter-intuitive and underexplored phenomenon. Naively applying Supervised Fine-Tuning (SFT) on interleaved plot-solution data leads to a substantial degradation in reasoning performance compared to text-only baselines. We argue that this failure stems from a fundamental limitation of SFT, which primarily induces distributional alignment: the model learns to reproduce the surface format of interleaved plotting but fails to internalize the causal dependency between the generated plot and reasoning steps. To overcome this limitation, we propose Faire (Functional alignment for interleaved reasoning), a reinforcement learning framework that enforces three casual constraints to move beyond superficial imitation toward functional alignment. Extensive experiments show that Faire induces a qualitative shift in model behavior in which the plotting is effectively internalized, yielding competitive performance on challenging geometric reasoning benchmarks.
Problem

Research questions and friction points this paper is trying to address.

interleaved reasoning
geometric reasoning
supervised fine-tuning
multimodal large language models
reinforcement learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

reinforcement learning
interleaved reasoning
functional alignment
geometric reasoning
multimodal LLMs
🔎 Similar Papers
No similar papers found.
X
Xiangxiang Zhang
ByteDance, China
C
Caijun Jia
ByteDance, China; Shenyang Institute of Computing Technology, Chinese Academy of Sciences, China
Siyuan Li
Siyuan Li
Zhejiang University & Westlake University (Ph.D Candidate)
AIGCNetwork ArchitectureSelf-supervised LearningOptimization
D
Dingyu He
ByteDance, China
X
Xiya Xiong
ByteDance, China
Z
Zheng Sun
ByteDance, China; Shenyang Institute of Computing Technology, Chinese Academy of Sciences, China
H
Honghao He
ByteDance, China; Shenyang Institute of Computing Technology, Chinese Academy of Sciences, China
Y
Yuchen Wu
ByteDance, China
B
Bihui Yu
Shenyang Institute of Computing Technology, Chinese Academy of Sciences, China
Linzhuang Sun
Linzhuang Sun
University of Chinese Academy of Sciences
Multimodal Reasoning
C
Cheng Tan
Westlake University, China
Jingxuan Wei
Jingxuan Wei
University of Chinese Academy of Sciences
Natural Language ProcessingMultimodal Learning