The Synergy Dilemma of Long-CoT SFT and RL: Investigating Post-Training Techniques for Reasoning VLMs

📅 2025-07-10

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This work investigates the synergistic effects of long-chain-of-thought supervised fine-tuning (long-CoT SFT) and reinforcement learning (RL) in post-training large vision-language models (VLMs). We identify a “synergy dilemma”: direct combination fails to yield performance gains and instead induces severe trade-offs among accuracy, reasoning-style consistency, and response length. Systematic evaluation of integration strategies—including two-stage, interleaved, progressive training, and data mixing/model merging—fails to overcome this bottleneck. Our core contribution is the first identification of an inherent objective conflict between SFT and RL in VLM reasoning enhancement: SFT strengthens complex reasoning but introduces redundancy, whereas RL improves generalization and response conciseness at the cost of chain-of-thought logical stability. Based on this insight, we propose the need for an adaptive joint training framework featuring objective alignment and dynamic regulation—establishing a novel paradigm for efficient reasoning optimization in VLMs.

Technology Category

Application Category

📝 Abstract

Large vision-language models (VLMs) increasingly adopt post-training techniques such as long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL) to elicit sophisticated reasoning. While these methods exhibit synergy in language-only models, their joint effectiveness in VLMs remains uncertain. We present a systematic investigation into the distinct roles and interplay of long-CoT SFT and RL across multiple multimodal reasoning benchmarks. We find that SFT improves performance on difficult questions by in-depth, structured reasoning, but introduces verbosity and degrades performance on simpler ones. In contrast, RL promotes generalization and brevity, yielding consistent improvements across all difficulty levels, though the improvements on the hardest questions are less prominent compared to SFT. Surprisingly, combining them through two-staged, interleaved, or progressive training strategies, as well as data mixing and model merging, all fails to produce additive benefits, instead leading to trade-offs in accuracy, reasoning style, and response length. This ``synergy dilemma'' highlights the need for more seamless and adaptive approaches to unlock the full potential of combined post-training techniques for reasoning VLMs.

Problem

Research questions and friction points this paper is trying to address.

Investigates synergy of long-CoT SFT and RL in VLMs

Examines trade-offs in accuracy and reasoning styles

Highlights need for adaptive post-training techniques

Innovation

Methods, ideas, or system contributions that make the work stand out.

Long-CoT SFT enhances structured reasoning

RL promotes generalization and brevity

Combined training strategies fail additive benefits

🔎 Similar Papers

Large Language Model Enhanced Knowledge Representation Learning: A Survey