🤖 AI Summary
Small-scale vision-language models (VLMs) suffer from limited cross-domain generalization and multi-step reasoning capabilities compared to their large-scale counterparts, hindering practical deployment. To address this, we propose Curriculum Reinforcement Fine-Tuning (Curr-ReFT), the first two-stage training framework tailored for small VLMs: (1) curriculum-based reinforcement learning guided by difficulty-aware rewards, and (2) rejection-sampling–enabled self-improvement coupled with multimodal self-distillation. Curr-ReFT is the first method to holistically integrate difficulty modeling, rejection sampling, and post-supervised-fine-tuning reinforcement alignment—thereby substantially enhancing robustness and cognitive reasoning in small models. Experiments demonstrate that a 3B-parameter Curr-ReFT model achieves state-of-the-art performance on both in-domain and cross-domain visual understanding benchmarks, matching the accuracy of a 32B baseline and effectively narrowing the capability gap between small and large VLMs.
📝 Abstract
While state-of-the-art vision-language models (VLMs) have demonstrated remarkable capabilities in complex visual-text tasks, their success heavily relies on massive model scaling, limiting their practical deployment. Small-scale VLMs offer a more practical alternative but face significant challenges when trained with traditional supervised fine-tuning (SFT), particularly in two aspects: out-of-domain (OOD) generalization and reasoning abilities, which significantly lags behind the contemporary Large language models (LLMs). To address these challenges, we propose Curriculum Reinforcement Finetuning (Curr-ReFT), a novel post-training paradigm specifically designed for small-scale VLMs. Inspired by the success of reinforcement learning in LLMs, Curr-ReFT comprises two sequential stages: (1) Curriculum Reinforcement Learning, which ensures steady progression of model capabilities through difficulty-aware reward design, transitioning from basic visual perception to complex reasoning tasks; and (2) Rejected Sampling-based Self-improvement, which maintains the fundamental capabilities of VLMs through selective learning from high-quality multimodal and language examples. Extensive experiments demonstrate that models trained with Curr-ReFT paradigm achieve state-of-the-art performance across various visual tasks in both in-domain and out-of-domain settings. Moreover, our Curr-ReFT enhanced 3B model matches the performance of 32B-parameter models, demonstrating that efficient training paradigms can effectively bridge the gap between small and large models.