TACO: Think-Answer Consistency for Optimized Long-Chain Reasoning and Efficient Data Learning via Reinforcement Learning in LVLMs

📅 2025-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address three prevalent challenges in long-chain visual reasoning with multimodal large language models (LVLMs)—inconsistency between reasoning paths and final answers, instability in long-horizon reasoning, and low data utilization efficiency—this paper proposes an end-to-end reinforcement learning framework based on Generalized Reinforcement Learning with Policy Optimization (GRPO). Key contributions include: (1) a novel Think-Answer Consistency Constraint that explicitly aligns intermediate reasoning steps with the final answer; (2) a Rollback Resample strategy that dynamically resets sampling to stabilize exploration over extended reasoning chains; and (3) test-time resolution-adaptive scaling and adaptive difficulty scheduling to enhance generalization and training efficiency. Evaluated on in-distribution and out-of-distribution benchmarks for referring expression comprehension (REC) and visual question answering (VQA), our method achieves significant performance gains, improved reasoning robustness, and stronger cross-distribution generalization.

Technology Category

Application Category

📝 Abstract
DeepSeek R1 has significantly advanced complex reasoning for large language models (LLMs). While recent methods have attempted to replicate R1's reasoning capabilities in multimodal settings, they face limitations, including inconsistencies between reasoning and final answers, model instability and crashes during long-chain exploration, and low data learning efficiency. To address these challenges, we propose TACO, a novel reinforcement learning algorithm for visual reasoning. Building on Generalized Reinforcement Policy Optimization (GRPO), TACO introduces Think-Answer Consistency, which tightly couples reasoning with answer consistency to ensure answers are grounded in thoughtful reasoning. We also introduce the Rollback Resample Strategy, which adaptively removes problematic samples and reintroduces them to the sampler, enabling stable long-chain exploration and future learning opportunities. Additionally, TACO employs an adaptive learning schedule that focuses on moderate difficulty samples to optimize data efficiency. Furthermore, we propose the Test-Time-Resolution-Scaling scheme to address performance degradation due to varying resolutions during reasoning while balancing computational overhead. Extensive experiments on in-distribution and out-of-distribution benchmarks for REC and VQA tasks show that fine-tuning LVLMs leads to significant performance improvements.
Problem

Research questions and friction points this paper is trying to address.

Ensures reasoning-answer consistency in multimodal LVLMs
Stabilizes long-chain reasoning to prevent model crashes
Optimizes data learning efficiency via adaptive sampling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning ensures reasoning-answer consistency
Rollback Resample Strategy stabilizes long-chain exploration
Adaptive learning schedule optimizes data efficiency
Zhehan Kan
Zhehan Kan
PhD student, Tsinghua University
CVMLLMsLLMs
Y
Yanlin Liu
Tsinghua University
K
Kun Yin
Tencent YouTu Lab
Xinghua Jiang
Xinghua Jiang
Tencent Youtu Lab
X
Xin Li
Tencent YouTu Lab
H
Haoyu Cao
Tencent YouTu Lab
Y
Yinsong Liu
Tencent YouTu Lab
Deqiang Jiang
Deqiang Jiang
腾讯优图实验室
Xing Sun
Xing Sun
Tencent Youtu Lab
LLMMLLMAgent
Q
Qingmin Liao
Tsinghua University
Wenming Yang
Wenming Yang
Tsinghua University
Computer VisionImage Processing