Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme

📅 2025-04-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing VLM reinforcement learning (RL) research suffers from high engineering complexity, poor reproducibility, and a lack of standardized evaluation protocols. To address these challenges, we propose the first end-to-end, fully reproducible RL framework tailored for vision-language models (VLMs), built upon policy gradient methods and supporting multiple architectures—including LLaVA and Qwen-VL—via a minimal four-step training pipeline. We introduce a novel standardized evaluation paradigm that jointly quantifies response length and reflection quality, enabling systematic analysis of training dynamics. Empirical validation across multiple visual reasoning benchmarks demonstrates that RL substantially outperforms supervised fine-tuning; reflection capability exhibits a strong positive correlation with output length; and results are highly sensitive to random seed initialization. This work establishes a transparent, reproducible, and comparable infrastructure—alongside empirically grounded benchmarks—for advancing VLM-RL research.

Technology Category

Application Category

📝 Abstract
Reinforcement learning (RL) has recently shown strong potential in improving the reasoning capabilities of large language models and is now being actively extended to vision-language models (VLMs). However, existing RL applications in VLMs often rely on heavily engineered frameworks that hinder reproducibility and accessibility, while lacking standardized evaluation protocols, making it difficult to compare results or interpret training dynamics. This work introduces a transparent, from-scratch framework for RL in VLMs, offering a minimal yet functional four-step pipeline validated across multiple models and datasets. In addition, a standardized evaluation scheme is proposed to assess training dynamics and reflective behaviors. Extensive experiments on visual reasoning tasks uncover key empirical findings: response length is sensitive to random seeds, reflection correlates with output length, and RL consistently outperforms supervised fine-tuning (SFT) in generalization, even with high-quality data. These findings, together with the proposed framework, aim to establish a reproducible baseline and support broader engagement in RL-based VLM research.
Problem

Research questions and friction points this paper is trying to address.

Lack of transparent RL frameworks for vision-language models
Absence of standardized evaluation protocols for RL in VLMs
Difficulty in comparing results due to inconsistent training dynamics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transparent from-scratch RL framework for VLMs
Standardized evaluation scheme for training dynamics
RL outperforms SFT in generalization benchmarks