PeRL: Permutation-Enhanced Reinforcement Learning for Interleaved Vision-Language Reasoning

📅 2025-06-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal reinforcement learning methods are limited to single-image spatial reasoning and struggle to model cross-image positional relationships, hindering their generalization in complex real-world scenarios. To address this, we propose a novel framework for interleaved vision-language reasoning over multiple images. Our approach introduces two key innovations: (1) image-sequence permutation augmentation to enhance structural invariance, and (2) rollout-filtering resampling to jointly optimize the exploration-exploitation trade-off. We further integrate policy gradient optimization with a multi-stage training strategy to robustly capture inter-image dependencies. Evaluated on five multi-image benchmarks, our method achieves state-of-the-art performance, significantly outperforming both R1-class models and interleaved vision-language model baselines. Moreover, it maintains competitive performance on three single-image benchmarks, demonstrating both effectiveness and strong generalization across task settings.

Technology Category

Application Category

📝 Abstract
Inspired by the impressive reasoning capabilities demonstrated by reinforcement learning approaches like DeepSeek-R1, recent emerging research has begun exploring the use of reinforcement learning (RL) to enhance vision-language models (VLMs) for multimodal reasoning tasks. However, most existing multimodal reinforcement learning approaches remain limited to spatial reasoning within single-image contexts, yet still struggle to generalize to more complex and real-world scenarios involving multi-image positional reasoning, where understanding the relationships across images is crucial. To address this challenge, we propose a general reinforcement learning approach PeRL tailored for interleaved multimodal tasks, and a multi-stage strategy designed to enhance the exploration-exploitation trade-off, thereby improving learning efficiency and task performance. Specifically, we introduce permutation of image sequences to simulate varied positional relationships to explore more spatial and positional diversity. Furthermore, we design a rollout filtering mechanism for resampling to focus on trajectories that contribute most to learning optimal behaviors to exploit learned policies effectively. We evaluate our model on 5 widely-used multi-image benchmarks and 3 single-image benchmarks. Our experiments confirm that PeRL trained model consistently surpasses R1-related and interleaved VLM baselines by a large margin, achieving state-of-the-art performance on multi-image benchmarks, while preserving comparable performance on single-image tasks.
Problem

Research questions and friction points this paper is trying to address.

Enhance vision-language models for complex multimodal reasoning tasks
Address multi-image positional reasoning challenges in real-world scenarios
Improve exploration-exploitation trade-off in reinforcement learning for VLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Permutation of image sequences for diversity
Multi-stage RL for exploration-exploitation balance
Rollout filtering mechanism for optimal trajectories
🔎 Similar Papers
No similar papers found.