Why Does RL Generalize Better Than SFT? A Data-Centric Perspective on VLM Post-Training

📅 2026-02-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing research indicates that reinforcement learning (RL) achieves better out-of-distribution (OOD) generalization than supervised fine-tuning (SFT) in the post-training of vision-language models, yet the underlying mechanism remains unclear. This work investigates the phenomenon from a data-centric perspective and reveals that RL’s advantage stems from its implicit preference for samples of moderate difficulty. Building on this insight, we propose Difficulty-Curated Supervised Fine-Tuning (DC-SFT), which explicitly selects moderately difficult examples for training. DC-SFT consistently outperforms both standard SFT and RL across multiple OOD tasks, while also enhancing training stability and substantially reducing computational overhead.

Technology Category

Application Category

📝 Abstract
The adaptation of large-scale Vision-Language Models (VLMs) through post-training reveals a pronounced generalization gap: models fine-tuned with Reinforcement Learning (RL) consistently achieve superior out-of-distribution (OOD) performance compared to those trained with Supervised Fine-Tuning (SFT). This paper posits a data-centric explanation for this phenomenon, contending that RL's generalization advantage arises from an implicit data filtering mechanism that inherently prioritizes medium-difficulty training samples. To test this hypothesis, we systematically evaluate the OOD generalization of SFT models across training datasets of varying difficulty levels. Our results confirm that data difficulty is a critical factor, revealing that training on hard samples significantly degrades OOD performance. Motivated by this finding, we introduce Difficulty-Curated SFT (DC-SFT), a straightforward method that explicitly filters the training set based on sample difficulty. Experiments show that DC-SFT not only substantially enhances OOD generalization over standard SFT, but also surpasses the performance of RL-based training, all while providing greater stability and computational efficiency. This work offers a data-centric account of the OOD generalization gap in VLMs and establishes a more efficient pathway to achieving robust generalization. Code is available at https://github.com/byyx666/DC-SFT.
Problem

Research questions and friction points this paper is trying to address.

generalization gap
Vision-Language Models
Reinforcement Learning
Supervised Fine-Tuning
out-of-distribution
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement Learning
Supervised Fine-Tuning
Out-of-Distribution Generalization
Data Difficulty
Vision-Language Models
🔎 Similar Papers
No similar papers found.
A
Aojun Lu
College of Computer Science, Sichuan University, Chengdu, China
T
Tao Feng
Department of Computer Science and Technology, Tsinghua University, Beijing, China
Hangjie Yuan
Hangjie Yuan
Alibaba DAMO | ZJU | MMLab@NTU
Generative ModelsMultimodal ModelsFoundation ModelsVideo Understanding
Wei Li
Wei Li
Sichuan University
Camera networks
Yanan Sun
Yanan Sun
Professor, College of Computer Science, Sichuan University
Neural Architecture Search