Answer-Consistent Chain-of-thought Reinforcement Learning For Multi-modal Large Langauge Models

📅 2025-10-11

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Multimodal large language models (MLLMs) frequently exhibit inconsistencies between generated reasoning chains and final answers in reinforcement learning settings, undermining logical coherence and task performance. To address this, we propose an answer consistency–enhanced training method built upon the GRPO framework. Our approach integrates chain-of-thought generation, randomized option reconstruction, and a two-stage consistency verification mechanism, coupled with a tailored reward function that explicitly penalizes reasoning–answer mismatches and mitigates reliance on spurious cues such as option ordering. The method significantly improves reasoning coherence and accuracy, yielding average gains of 2.2% on video reasoning and 1.5% on mathematical multimodal reasoning tasks. On the MMVU benchmark, it achieves over 79.7% reasoning consistency—setting a scalable, consistency-aware paradigm for multimodal RLHF.

Technology Category

Application Category

📝 Abstract

Recent advances in large language models (LLMs) have demonstrated that reinforcement learning with verifiable rewards (RLVR) can significantly enhance reasoning abilities by directly optimizing correctness, rather than relying solely on supervised imitation. This paradigm has been extended to multimodal LLMs for complex video and image understanding tasks. However, while outcome-driven RL improves answer accuracy, it can inadvertently decouple the reasoning chain from the final answer, leading to situations where models produce inconsistency between the reasoning trace and final answer. In our experiments on multiple-choice visual question-answering tasks, the standard GRPO method yields only 79.7% consistency on MMVU between the reasoning steps and the chosen answers, indicating frequent mismatches between answers and reasoning. To this end, we propose Answer-Consistent Reinforcement Learning (ACRE) that modifies the GRPO algorithm with an auxiliary consistency check. After the model generates a chain of thought and an initial answer for a given question, we shuffle the answer options and prompt the model again with the same reasoning trace to predict a second answer. We design a consistency-verification reward that grants a high reward only if both the original and the post-shuffle answers agree and are correct; otherwise, a lower reward is assigned accordingly. This mechanism penalizes reasoning-answer misalignment and discourages the model from relying on spurious patterns, such as option ordering biases. We evaluate ACRE on challenging Video Reasoning benchmarks and multimodal math reasoning benchmarks, achieving an average 2.2% and 1.5% improvement for Video Reasoning and Math Reasoning tasks over the GRPO baseline.

Problem

Research questions and friction points this paper is trying to address.

Addresses reasoning-answer inconsistency in multimodal LLMs

Improves alignment between reasoning chains and final answers

Mitigates spurious pattern reliance in visual question answering

Innovation

Methods, ideas, or system contributions that make the work stand out.

ACRE method modifies GRPO with consistency checks

Shuffles answer options to verify reasoning alignment

Uses dual-answer agreement for reward calculation

🔎 Similar Papers

No similar papers found.