SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models

📅 2025-04-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies a critical issue in large vision-language model (LVLM) training: supervised fine-tuning (SFT) introduces verbose, low-informativeness, and even erroneous “pseudo-reasoning paths”, severely hindering subsequent reinforcement learning (RL) from optimizing genuine reasoning capabilities and causing models to stagnate in rigid imitation. To systematically diagnose and address this problem, we construct VLAA-Thinking—a multimodal reasoning benchmark—and propose GRPO, an RL framework integrating dual-dimensional reward modeling for both perception and cognition. We further design a multi-stage reasoning distillation strategy that explicitly decouples SFT-induced superficial imitation from deep reasoning competence. Instantiated on the Qwen2.5VL-3B architecture, our model VLAA-Thinker achieves state-of-the-art performance on the Open LMM Reasoning Leaderboard among 4B-scale models, outperforming prior best methods by 1.8%.

Technology Category

Application Category

📝 Abstract
This work revisits the dominant supervised fine-tuning (SFT) then reinforcement learning (RL) paradigm for training Large Vision-Language Models (LVLMs), and reveals a key finding: SFT can significantly undermine subsequent RL by inducing ``pseudo reasoning paths'' imitated from expert models. While these paths may resemble the native reasoning paths of RL models, they often involve prolonged, hesitant, less informative steps, and incorrect reasoning. To systematically study this effect, we introduce VLAA-Thinking, a new multimodal dataset designed to support reasoning in LVLMs. Constructed via a six-step pipeline involving captioning, reasoning distillation, answer rewrite and verification, VLAA-Thinking comprises high-quality, step-by-step visual reasoning traces for SFT, along with a more challenging RL split from the same data source. Using this dataset, we conduct extensive experiments comparing SFT, RL and their combinations. Results show that while SFT helps models learn reasoning formats, it often locks aligned models into imitative, rigid reasoning modes that impede further learning. In contrast, building on the Group Relative Policy Optimization (GRPO) with a novel mixed reward module integrating both perception and cognition signals, our RL approach fosters more genuine, adaptive reasoning behavior. Notably, our model VLAA-Thinker, based on Qwen2.5VL 3B, achieves top-1 performance on Open LMM Reasoning Leaderboard (https://huggingface.co/spaces/opencompass/Open_LMM_Reasoning_Leaderboard) among 4B scale LVLMs, surpassing the previous state-of-the-art by 1.8%. We hope our findings provide valuable insights in developing reasoning-capable LVLMs and can inform future research in this area.
Problem

Research questions and friction points this paper is trying to address.

Investigates SFT's negative impact on RL in LVLMs
Introduces VLAA-Thinking dataset for visual reasoning
Proposes GRPO-based RL for adaptive reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces VLAA-Thinking dataset for LVLM reasoning
Uses GRPO with mixed reward for adaptive reasoning
Achieves top performance with VLAA-Thinker model
🔎 Similar Papers
No similar papers found.