RL makes MLLMs see better than SFT

📅 2025-10-17

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

Existing multimodal large language model (MLLM) research overemphasizes the role of the LLM backbone while neglecting the dynamic evolution of the visual encoder across training paradigms—particularly from supervised fine-tuning (SFT) to reinforcement learning (RL)—and its consequential impact on visual perception capabilities. This work is the first to systematically demonstrate that RL significantly enhances and precisely localizes visual representations, whereas SFT fails to achieve comparable gains. To address this, we propose Preference-Instructed Vision OpTimization (PIVOT), a lightweight, efficient, and preference-driven visual encoder optimization framework. Through comprehensive evaluation—including ImageNet classification, semantic segmentation, and gradient-based visualization—we show that PIVOT achieves superior performance on vision-intensive tasks such as VQA, outperforming larger, more heavily trained baselines, while incurring less than 1% of conventional pretraining cost. Our approach establishes a novel paradigm for high-performance visual encoder design.

Technology Category

Application Category

📝 Abstract

A dominant assumption in Multimodal Language Model (MLLM) research is that its performance is largely inherited from the LLM backbone, given its immense parameter scale and remarkable capabilities. This has created a void in the understanding of the vision encoder, which determines how MLLMs perceive images. The recent shift in MLLM training paradigms, from Supervised Finetuning (SFT) to Reinforcement Learning (RL), magnifies this oversight-namely, the significant lack of analysis on how such training reshapes the vision encoder as well as the MLLM. To address this, we first investigate the impact of training strategies on MLLMs, where RL shows a clear advantage over SFT in strongly vision-related VQA benchmarks. Motivated by this, we conduct a critical yet under-explored analysis of the vision encoder of MLLMs through diverse and in-depth experiments, ranging from ImageNet classification and segmentation to gradient visualization. Our results demonstrate that MLLM's post-training strategy (i.e., SFT or RL) not only leads to distinct outcomes on MLLM downstream tasks, but also fundamentally reshapes MLLM's underlying visual representations. Specifically, the key finding of our study is that RL produces stronger and precisely localized visual representations compared to SFT, boosting the ability of the vision encoder for MLLM. We then reframe our findings into a simple recipe for building strong vision encoders for MLLMs, Preference-Instructed Vision OpTimization (PIVOT). When integrated into MLLMs, a PIVOT-trained vision encoder outperforms even larger and more heavily-trained counterparts, despite requiring less than 1% of the computational cost of standard vision pretraining. This result opens an effective and efficient path for advancing the vision backbones of MLLMs. Project page available at https://june-page.github.io/pivot/

Problem

Research questions and friction points this paper is trying to address.

Analyzing how RL reshapes vision encoder representations in MLLMs

Comparing visual representation strength between RL and SFT training

Developing efficient vision backbone optimization for multimodal models

Innovation

Methods, ideas, or system contributions that make the work stand out.

RL enhances vision encoder localization over SFT

PIVOT method trains vision encoders efficiently

PIVOT reduces computational cost by over 99%

🔎 Similar Papers

Law of Vision Representation in MLLMs