🤖 AI Summary
Small-scale vision-language models (SVLMs) suffer from unreliable reasoning, weak instruction following, spurious chain-of-thought artifacts, and capability collapse due to parameter constraints. To address these issues, this paper proposes Dynamic Memory-Exploration (DyME), a novel training paradigm that end-to-end dynamically alternates between memory-based learning (supervised fine-tuning, SFT) and exploration-based learning (verifiable-reward reinforcement learning, RLVR) during optimization—preventing suboptimal convergence and enabling synergistic integration of both paradigms. DyME requires no handcrafted chain-of-thought templates or external verifiers. Empirical results demonstrate substantial improvements in logical consistency, answer accuracy, and cross-domain generalization on diverse visual reasoning benchmarks, consistently outperforming state-of-the-art SFT- and RL-based baselines. This work establishes a viable pathway toward reliable reasoning in lightweight SVLMs.
📝 Abstract
Empowering Small-scale Vision-Language Models (SVLMs) with reliable thinking capabilities remains fundamentally challenging due to their limited parameter capacity and weak instruction-following abilities. Existing training paradigms, including Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Reward (RLVR), impose substantial demands on the base VLM, exceeding the capabilities of SVLMs. Consequently, directly applying these paradigms to SVLMs often suffers from severe pseudo thinking traces and advantage collapse, ultimately undermining both thinking reliability and task performance. A natural solution is to combine SFT and RLVR, leveraging their complementarity to reduce the dependence on model capacity. However, the widely adopted two-stage training paradigm still performs poorly on SVLMs, as their tendency toward sub-optimal convergence hinders the trade-off and limits the benefits of the combination. To address this, we propose DyME, a novel training paradigm that Dynamically selects between Memorization (via SFT) and Exploration (via RLVR) modes at each optimization step, ensuring that every update contributes to the trade-off. Extensive experiments across diverse domains demonstrate that DyME consistently achieves this balance, and thus delivers substantial performance improvements. These results establish DyME as a practical and effective solution for empowering SVLMs with reliable thinking capabilities. GitHub: https://github.com/HKUST-LongGroup/DyME