🤖 AI Summary
This work addresses two key limitations of vision-language models (VLMs): (1) poor reasoning robustness under imperfect visual perception, and (2) low efficiency in test-time computational scaling. To this end, we propose a vision-oriented inductive-bias-aware reinforcement learning framework. Methodologically: (1) we design a trajectory mixing strategy that jointly trains on clean and controllably distorted images; (2) we introduce a progressive noise annealing mechanism to balance exploration diversity and convergence stability at low training cost; and (3) we adopt a plug-and-play architecture requiring no additional parameter fine-tuning. Evaluated on five cross-domain benchmarks, our approach—trained on only 2.1K samples—consistently outperforms existing open-source RL-finetuned VLMs, while preserving or even improving in-domain performance. The method significantly enhances both visual robustness and test-time computational scalability of VLMs.
📝 Abstract
Recent advances in reinforcement learning (RL) have strengthened the reasoning capabilities of vision-language models (VLMs). However, enhancing policy exploration to more effectively scale test-time compute remains underexplored in VLMs. In addition, VLMs continue to struggle with imperfect visual perception, which in turn affects the subsequent reasoning process. To this end, we propose NoisyRollout, a simple yet effective RL approach that mixes trajectories from both clean and moderately distorted images to introduce targeted diversity in visual perception and the resulting reasoning patterns. Without additional training cost, NoisyRollout enhances the exploration capabilities of VLMs by incorporating a vision-oriented inductive bias. Furthermore, NoisyRollout employs a noise annealing schedule that gradually reduces distortion strength over training, ensuring benefit from noisy signals early while maintaining training stability and scalability in later stages. With just 2.1K training samples, NoisyRollout achieves state-of-the-art performance among open-source RL-tuned models on 5 out-of-domain benchmarks spanning both reasoning and perception tasks, while preserving comparable or even better in-domain performance.