GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning

📅 2025-06-19

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing multimodal large language models (MLLMs) lack rigorous evaluation of reinforcement learning (RL) post-training for video understanding, and standard GRPO methods optimize only answer correctness—ignoring logical consistency between reasoning steps and final answers—yielding a mere 57.9% consistency rate. Method: We propose a consistency-aware RL framework featuring a two-tier reward mechanism: a base reward ensuring answer correctness, and an adaptive consistency reward—computed via reference-model guidance and peer-group comparison—to explicitly model logical coherence across reasoning chains; we further replace rigid KL divergence constraints with unsupervised consistency optimization. Contribution/Results: On the most challenging tier (SEED-Bench-R1), our method achieves a 6.7% absolute accuracy gain and raises reasoning consistency to 82.4% (+24.5 percentage points), significantly improving cross-task generalization and transferability.

Technology Category

Application Category

📝 Abstract

Recent reinforcement learning approaches, such as outcome-supervised GRPO, have advanced Chain-of-Thought reasoning in large language models (LLMs), yet their adaptation to multimodal LLMs (MLLMs) is unexplored. To address the lack of rigorous evaluation for MLLM post-training methods, we introduce SEED-Bench-R1, a benchmark with complex real-world videos requiring balanced perception and reasoning. It offers a large training set and evaluates generalization across three escalating challenges: in-distribution, cross-environment, and cross-environment-task scenarios. Using SEED-Bench-R1, we find that standard GRPO, while improving answer accuracy, often reduces logical coherence between reasoning steps and answers, with only a 57.9% consistency rate. This stems from reward signals focusing solely on final answers, encouraging shortcuts, and strict KL penalties limiting exploration.To address this, we propose GRPO-CARE, a consistency-aware RL framework optimizing both answer correctness and reasoning coherence without explicit supervision. GRPO-CARE introduces a two-tiered reward: (1) a base reward for answer correctness, and (2) an adaptive consistency bonus, computed by comparing the model's reasoning-to-answer likelihood (via a slowly-evolving reference model) against group peers.This dual mechanism amplifies rewards for reasoning paths that are both correct and logically consistent. Replacing KL penalties with this adaptive bonus, GRPO-CARE outperforms standard GRPO on SEED-Bench-R1, achieving a 6.7% performance gain on the hardest evaluation level and a 24.5% improvement in consistency. It also shows strong transferability, improving model performance across diverse video understanding benchmarks. Our work contributes a systematically designed benchmark and a generalizable post-training framework, advancing the development of more interpretable and robust MLLMs.

Problem

Research questions and friction points this paper is trying to address.

Evaluating multimodal LLM post-training methods rigorously

Improving answer accuracy and logical coherence in reasoning

Enhancing consistency in reinforcement learning for MLLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Consistency-aware RL framework for MLLMs

Two-tiered reward system for correctness and coherence

Adaptive consistency bonus replaces KL penalties

🔎 Similar Papers

No similar papers found.

Authors to Follow