CARE What Fails: Contrastive Anchored-REflection for Verifiable Multimodal

📅 2025-12-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the neglect of failure samples, credit assignment bias, and gradient stagnation in verifiable multimodal group relative reinforcement learning (RLVR), this paper proposes an error-centric post-training framework that explicitly converts failure trajectories into supervisory signals. We introduce two key innovations: (1) an anchoring contrastive loss and (2) a reflection-guided resampling (RGR) mechanism—enabling zero-overhead transformation of failure samples into high-quality positive examples for the first time. Furthermore, we enhance failure-driven learning via subgroup z-score normalization, negative-sample-specific scaling, and full-negative-sample rescue. Evaluated on Qwen2.5-VL-7B, our method achieves a +4.6-point average accuracy gain over GRPO. On Qwen3-VL-8B, it attains state-of-the-art or leading performance on MathVista and MMMU-Pro, significantly improving both training efficiency and interpretability.

Technology Category

Application Category

📝 Abstract
Group-relative reinforcement learning with verifiable rewards (RLVR) often wastes the most informative data it already has the failures. When all rollouts are wrong, gradients stall; when one happens to be correct, the update usually ignores why the others are close-but-wrong, and credit can be misassigned to spurious chains. We present CARE (Contrastive Anchored REflection), a failure-centric post-training framework for multimodal reasoning that turns errors into supervision. CARE combines: (i) an anchored-contrastive objective that forms a compact subgroup around the best rollout and a set of semantically proximate hard negatives, performs within-subgroup z-score normalization with negative-only scaling, and includes an all-negative rescue to prevent zero-signal batches; and (ii) Reflection-Guided Resampling (RGR), a one-shot structured self-repair that rewrites a representative failure and re-scores it with the same verifier, converting near-misses into usable positives without any test-time reflection. CARE improves accuracy and training smoothness while explicitly increasing the share of learning signal that comes from failures. On Qwen2.5-VL-7B, CARE lifts macro-averaged accuracy by 4.6 points over GRPO across six verifiable visual-reasoning benchmarks; with Qwen3-VL-8B it reaches competitive or state-of-the-art results on MathVista and MMMU-Pro under an identical evaluation protocol.
Problem

Research questions and friction points this paper is trying to address.

Addresses inefficient learning from failures in verifiable multimodal RL.
Converts errors into supervision via anchored-contrastive and self-repair methods.
Improves accuracy and training smoothness on visual-reasoning benchmarks.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Contrastive anchored objective with subgroup normalization
Reflection-guided resampling converts failures into positives
Failure-centric post-training framework for multimodal reasoning
🔎 Similar Papers
No similar papers found.