🤖 AI Summary
This work addresses the susceptibility of current vision-language models to visual perception errors and hallucinations during complex reasoning, a challenge exacerbated by the low sampling efficiency and sparse rewards of conventional reinforcement learning approaches that struggle to disentangle error sources. To overcome these limitations, the authors propose MIRL, a novel framework that introduces mutual information into vision-language reinforcement learning for the first time. Specifically, MIRL leverages the mutual information between generated captions and visual inputs as a pre-screening signal, employs a trajectory-forking mechanism to intelligently allocate sampling budgets, and adopts a decoupled training strategy that provides dedicated rewards for the visual perception stage. Evaluated across six benchmarks, MIRL achieves an average accuracy of 70.22%, surpassing the performance of methods using 16 full trajectories with only 10 pre-samples followed by top-6 selection—reducing sampling cost by 25% while significantly improving both training efficiency and accuracy.
📝 Abstract
Vision-Language Models (VLMs) frequently suffer from visual perception errors and hallucinations that compromise answer accuracy in complex reasoning tasks. Reinforcement Learning with Verifiable Rewards (RLVR) offers a promising solution by optimizing policies using answer correctness signals. Despite their effectiveness, prevailing RLVR methods face two critical limitations. First, much of the sampling budget is wasted on trajectories doomed to fail due to early visual description errors. Second, sparse rewards cannot distinguish whether failures stem from visual perception or reasoning stages. We introduce MIRL, a decoupled framework that addresses both limitations by leveraging mutual information (MI) between generated descriptions and visual inputs as a cheap pre-screening signal. This enables intelligent budget allocation toward high-potential trajectories via forking, while decoupled training provides independent MI-based rewards for visual perception optimization, resolving reward blindness. Experiments on six vision-language reasoning benchmarks demonstrate that MIRL achieves 70.22% average accuracy and successfully surpasses the performance of sampling 16 complete trajectories using only 10 pre-samples with top-6 selection (25% fewer complete trajectories). Our code is available at: https://anonymous.4open.science/r/mirl-main/.