Point-RFT: Improving Multimodal Reasoning with Visually Grounded Reinforcement Finetuning

📅 2025-05-26

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

To address visual hallucinations and insufficient cross-modal fusion caused by pure-text chain-of-thought (CoT) reasoning in vision-language tasks, this paper proposes a vision-anchored, stage-wise reinforcement fine-tuning framework. Our contributions are threefold: (1) the first explicit chain-of-thought annotation paradigm aligned with visual elements; (2) the construction of VisReason, the first large-scale (71K instances), formatted fine-tuning dataset specifically designed for visual reasoning; and (3) a synergistic paradigm integrating vision-guided formatting fine-tuning, vision-anchored CoT generation, and reward-modeling-based reinforcement fine-tuning. Our method achieves 90.04% accuracy on ChartQA—substantially outperforming pure-text CoT (83.92%)—and demonstrates strong cross-domain generalization on benchmarks including CharXiv and PlotQA. These results validate the effectiveness and robustness of our approach in real-world visual document understanding scenarios.

Technology Category

Application Category

📝 Abstract

Recent advances in large language models have significantly improved textual reasoning through the effective use of Chain-of-Thought (CoT) and reinforcement learning. However, extending these successes to vision-language tasks remains challenging due to inherent limitations in text-only CoT, such as visual hallucinations and insufficient multimodal integration. In this paper, we introduce Point-RFT, a multimodal reasoning framework explicitly designed to leverage visually grounded CoT reasoning for visual document understanding. Our approach consists of two stages: First, we conduct format finetuning using a curated dataset of 71K diverse visual reasoning problems, each annotated with detailed, step-by-step rationales explicitly grounded to corresponding visual elements. Second, we employ reinforcement finetuning targeting visual document understanding. On ChartQA, our approach improves accuracy from 70.88% (format-finetuned baseline) to 90.04%, surpassing the 83.92% accuracy achieved by reinforcement finetuning relying solely on text-based CoT. The result shows that our grounded CoT is more effective for multimodal reasoning compared with the text-only CoT. Moreover, Point-RFT exhibits superior generalization capability across several out-of-domain visual document reasoning benchmarks, including CharXiv, PlotQA, IconQA, TabMWP, etc., and highlights its potential in complex real-world scenarios.

Problem

Research questions and friction points this paper is trying to address.

Extends textual reasoning to vision-language tasks effectively

Reduces visual hallucinations in multimodal reasoning tasks

Improves accuracy in visual document understanding benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Visually grounded Chain-of-Thought reasoning

Two-stage format and reinforcement finetuning

Improved multimodal document understanding accuracy

🔎 Similar Papers

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling