π€ AI Summary
This work addresses the challenge of reward hacking in multimodal large language models (MLLMs) within reinforcement learning, where unreliable reasoning processes hinder effective policy optimization. To mitigate this issue, the authors propose a context-augmented reinforcement learning framework that enhances the reward modelβs ability to evaluate reasoning by incorporating complete reference solutions. The approach integrates multi-turn sampling and an error-reporting mechanism to guide the policy toward recovery from failures, complemented by fine-grained process verification. Evaluated across 11 perception and reasoning benchmarks, the method substantially outperforms standard RLVR, enabling Qwen3-VL-8B to achieve performance comparable to that of a 32B-scale model while effectively suppressing reward hacking behaviors.
π Abstract
We propose ContextRL, a novel framework that leverages context augmentation to overcome these bottlenecks. Specifically, to enhance Identifiability, we provide the reward model with full reference solutions as context, enabling fine-grained process verification to filter out false positives (samples with the right answer but low-quality reasoning process). To improve Reachability, we introduce a multi-turn sampling strategy where the reward model generates mistake reports for failed attempts, guiding the policy to "recover" correct responses from previously all-negative groups. Experimental results on 11 perception and reasoning benchmarks show that ContextRL significantly improves knowledge discovery efficiency. Notably, ContextRL enables the Qwen3-VL-8B model to achieve performance comparable to the 32B model, outperforming standard RLVR baselines by a large margin while effectively mitigating reward hacking. Our in-depth analysis reveals the significant potential of contextual information for improving reward model accuracy and document the widespread occurrence of reward hacking, offering valuable insights for future RLVR research.