NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation

📅 2025-04-17

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

This work addresses two key limitations of vision-language models (VLMs): (1) poor reasoning robustness under imperfect visual perception, and (2) low efficiency in test-time computational scaling. To this end, we propose a vision-oriented inductive-bias-aware reinforcement learning framework. Methodologically: (1) we design a trajectory mixing strategy that jointly trains on clean and controllably distorted images; (2) we introduce a progressive noise annealing mechanism to balance exploration diversity and convergence stability at low training cost; and (3) we adopt a plug-and-play architecture requiring no additional parameter fine-tuning. Evaluated on five cross-domain benchmarks, our approach—trained on only 2.1K samples—consistently outperforms existing open-source RL-finetuned VLMs, while preserving or even improving in-domain performance. The method significantly enhances both visual robustness and test-time computational scalability of VLMs.

Technology Category

Application Category

📝 Abstract

Recent advances in reinforcement learning (RL) have strengthened the reasoning capabilities of vision-language models (VLMs). However, enhancing policy exploration to more effectively scale test-time compute remains underexplored in VLMs. In addition, VLMs continue to struggle with imperfect visual perception, which in turn affects the subsequent reasoning process. To this end, we propose NoisyRollout, a simple yet effective RL approach that mixes trajectories from both clean and moderately distorted images to introduce targeted diversity in visual perception and the resulting reasoning patterns. Without additional training cost, NoisyRollout enhances the exploration capabilities of VLMs by incorporating a vision-oriented inductive bias. Furthermore, NoisyRollout employs a noise annealing schedule that gradually reduces distortion strength over training, ensuring benefit from noisy signals early while maintaining training stability and scalability in later stages. With just 2.1K training samples, NoisyRollout achieves state-of-the-art performance among open-source RL-tuned models on 5 out-of-domain benchmarks spanning both reasoning and perception tasks, while preserving comparable or even better in-domain performance.

Problem

Research questions and friction points this paper is trying to address.

Enhancing policy exploration in vision-language models

Addressing imperfect visual perception in reasoning tasks

Improving model performance with minimal training data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixes clean and distorted image trajectories

Uses vision-oriented inductive bias

Employs noise annealing schedule

🔎 Similar Papers

Zero-Shot Generalization of Vision-Based RL Without Data Augmentation