Chart-RVR: Reinforcement Learning with Verifiable Rewards for Explainable Chart Reasoning

📅 2025-10-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large Vision-Language Models (LVLMs) suffer from weak out-of-distribution (OOD) generalization and unreliable chain-of-thought (CoT) reasoning in chart understanding, undermining interpretability and robustness. To address these challenges, we propose Chart-RVR—a novel framework that jointly optimizes chart type recognition, table reconstruction, and reasoning generation. Chart-RVR introduces Group-wise Relative Policy Optimization (GRPO) coupled with a triple automatic verifiable reward mechanism, enforcing concurrent constraints on CoT faithfulness, structural consistency, and logical verifiability. Evaluated on a 3B-parameter LVLM across six chart reasoning benchmarks, Chart-RVR achieves state-of-the-art performance, substantially narrows the in-distribution vs. OOD accuracy gap, and generates more accurate, interpretable, and verifiable reasoning paths. To our knowledge, this is the first work to systematically enhance both generalization capability and interpretability of LVLMs for chart understanding tasks.

Technology Category

Application Category

📝 Abstract
The capabilities of Large Vision-Language Models (LVLMs) have reached state-of-the-art on many visual reasoning tasks, including chart reasoning, yet they still falter on out-of-distribution (OOD) data, and degrade further when asked to produce their chain-of-thought (CoT) rationales, limiting explainability. We present Chart-RVR, a general framework that fine-tunes LVLMs to be more robust and explainable for chart reasoning by coupling Group Relative Policy Optimization (GRPO) with automatically verifiable rewards. Our framework comprises of three rewards that maximize: (i) correct chart-type classification, (ii) faithful chart table reconstruction, and (iii) process conformity. Applied to 3-billion-parameter LVLMs, Chart-RVR consistently outperforms standard supervised fine-tuning (SFT) on both in-distribution and out-of-distribution datasets, closing the OOD performance gap while improving rationale fidelity. The resulting models, the Chart-RVR-3B series, achieve state-of-the-art results on six chart-reasoning benchmarks spanning in-domain and OOD settings, surpassing all existing models of comparable size. Beyond accuracy, Chart-RVR yields more interpretable CoT rationales, strengthening trust and reliability - showcasing the power of verifiable rewards with GRPO for training reliable, interpretable chart-reasoning models.
Problem

Research questions and friction points this paper is trying to address.

Improving chart reasoning robustness on out-of-distribution data
Enhancing explainability through faithful chain-of-thought rationales
Developing verifiable reward framework for reliable chart interpretation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tunes LVLMs with Group Relative Policy Optimization
Uses verifiable rewards for chart-type classification
Improves chart table reconstruction and process conformity
🔎 Similar Papers
No similar papers found.
Sanchit Sinha
Sanchit Sinha
University of Virginia
Natural Language ProcessingMachine LearningComputer Vision
Oana Frunza
Oana Frunza
Morgan Stanley
NLP/ML
K
Kashif Rasul
Morgan Stanley
Y
Yuriy Nevmyvaka
Morgan Stanley
A
Aidong Zhang
University of Virginia