🤖 AI Summary
To address visual hallucinations and insufficient cross-modal fusion caused by pure-text chain-of-thought (CoT) reasoning in vision-language tasks, this paper proposes a vision-anchored, stage-wise reinforcement fine-tuning framework. Our contributions are threefold: (1) the first explicit chain-of-thought annotation paradigm aligned with visual elements; (2) the construction of VisReason, the first large-scale (71K instances), formatted fine-tuning dataset specifically designed for visual reasoning; and (3) a synergistic paradigm integrating vision-guided formatting fine-tuning, vision-anchored CoT generation, and reward-modeling-based reinforcement fine-tuning. Our method achieves 90.04% accuracy on ChartQA—substantially outperforming pure-text CoT (83.92%)—and demonstrates strong cross-domain generalization on benchmarks including CharXiv and PlotQA. These results validate the effectiveness and robustness of our approach in real-world visual document understanding scenarios.
📝 Abstract
Recent advances in large language models have significantly improved textual reasoning through the effective use of Chain-of-Thought (CoT) and reinforcement learning. However, extending these successes to vision-language tasks remains challenging due to inherent limitations in text-only CoT, such as visual hallucinations and insufficient multimodal integration. In this paper, we introduce Point-RFT, a multimodal reasoning framework explicitly designed to leverage visually grounded CoT reasoning for visual document understanding. Our approach consists of two stages: First, we conduct format finetuning using a curated dataset of 71K diverse visual reasoning problems, each annotated with detailed, step-by-step rationales explicitly grounded to corresponding visual elements. Second, we employ reinforcement finetuning targeting visual document understanding. On ChartQA, our approach improves accuracy from 70.88% (format-finetuned baseline) to 90.04%, surpassing the 83.92% accuracy achieved by reinforcement finetuning relying solely on text-based CoT. The result shows that our grounded CoT is more effective for multimodal reasoning compared with the text-only CoT. Moreover, Point-RFT exhibits superior generalization capability across several out-of-domain visual document reasoning benchmarks, including CharXiv, PlotQA, IconQA, TabMWP, etc., and highlights its potential in complex real-world scenarios.