Boosting the Generalization and Reasoning of Vision Language Models with Curriculum Reinforcement Learning

📅 2025-03-10

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Small-scale vision-language models (VLMs) suffer from limited cross-domain generalization and multi-step reasoning capabilities compared to their large-scale counterparts, hindering practical deployment. To address this, we propose Curriculum Reinforcement Fine-Tuning (Curr-ReFT), the first two-stage training framework tailored for small VLMs: (1) curriculum-based reinforcement learning guided by difficulty-aware rewards, and (2) rejection-sampling–enabled self-improvement coupled with multimodal self-distillation. Curr-ReFT is the first method to holistically integrate difficulty modeling, rejection sampling, and post-supervised-fine-tuning reinforcement alignment—thereby substantially enhancing robustness and cognitive reasoning in small models. Experiments demonstrate that a 3B-parameter Curr-ReFT model achieves state-of-the-art performance on both in-domain and cross-domain visual understanding benchmarks, matching the accuracy of a 32B baseline and effectively narrowing the capability gap between small and large VLMs.

Technology Category

Application Category

📝 Abstract

While state-of-the-art vision-language models (VLMs) have demonstrated remarkable capabilities in complex visual-text tasks, their success heavily relies on massive model scaling, limiting their practical deployment. Small-scale VLMs offer a more practical alternative but face significant challenges when trained with traditional supervised fine-tuning (SFT), particularly in two aspects: out-of-domain (OOD) generalization and reasoning abilities, which significantly lags behind the contemporary Large language models (LLMs). To address these challenges, we propose Curriculum Reinforcement Finetuning (Curr-ReFT), a novel post-training paradigm specifically designed for small-scale VLMs. Inspired by the success of reinforcement learning in LLMs, Curr-ReFT comprises two sequential stages: (1) Curriculum Reinforcement Learning, which ensures steady progression of model capabilities through difficulty-aware reward design, transitioning from basic visual perception to complex reasoning tasks; and (2) Rejected Sampling-based Self-improvement, which maintains the fundamental capabilities of VLMs through selective learning from high-quality multimodal and language examples. Extensive experiments demonstrate that models trained with Curr-ReFT paradigm achieve state-of-the-art performance across various visual tasks in both in-domain and out-of-domain settings. Moreover, our Curr-ReFT enhanced 3B model matches the performance of 32B-parameter models, demonstrating that efficient training paradigms can effectively bridge the gap between small and large models.

Problem

Research questions and friction points this paper is trying to address.

Improves out-of-domain generalization in small-scale VLMs

Enhances reasoning abilities of small-scale VLMs

Bridges performance gap between small and large VLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Curriculum Reinforcement Learning for VLMs

Difficulty-aware reward design progression

Rejected Sampling-based Self-improvement technique

🔎 Similar Papers

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling