VOLD: Reasoning Transfer from LLMs to Vision-Language Models via On-Policy Distillation

📅 2025-10-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Visual-language models (VLMs) struggle to develop advanced multimodal reasoning capabilities due to the scarcity of high-quality image-text reasoning data. Method: This paper introduces VOLD, the first framework enabling cross-modal reasoning capability transfer via online policy distillation. It integrates Group Relative Policy Optimization (GRPO)—a reinforcement learning algorithm—with supervised fine-tuning (SFT), wherein a large language model (LLM) serves as the teacher and a VLM as the student, performing online policy knowledge distillation. The work further identifies and validates the critical role of cold-start alignment in this process. Contribution/Results: Evaluated on diverse multimodal reasoning benchmarks—including MMMU-Pro, MathVision, MathVista, and LogicVista—VOLD substantially outperforms existing state-of-the-art methods. Notably, it achieves superior generalization on mathematical and logical reasoning tasks, establishing a novel paradigm for low-resource modality transfer.

Technology Category

Application Category

📝 Abstract
Training vision-language models (VLMs) for complex reasoning remains a challenging task, i.a. due to the scarcity of high-quality image-text reasoning data. Conversely, text-based reasoning resources are abundant and scalable, but it is still an open question how to leveraging them for VLM reasoning. To address this problem, we propose VOLD, a framework to transfer reasoning capabilities from text-only teacher models to VLM student models. To this end, VOLD combines reinforcement learning via Group Relative Policy Optimization (GRPO) with on-policy distillation, which allows the student reasoning traces to be guided by the teacher model, resulting in a significant gain over using GRPO alone. We further show that a cold-start alignment is essential for an effective transfer during the online training phase in this scenario and that without sufficient distributional alignment between teacher and student, on-policy distillation fails to provide meaningful guidance. We evaluate VOLD across diverse benchmarks including MMMU-Pro, MathVision, MathVista, and LogicVista, showing that VOLD outperforms the baseline model significantly and improves over the state of the art by a margin. Our ablation shows the importance of a cold-start alignment via SFT for on-policy distillation with a text-only teacher.
Problem

Research questions and friction points this paper is trying to address.

Transferring reasoning skills from text-only LLMs to vision-language models
Addressing scarcity of high-quality image-text reasoning training data
Enhancing VLM reasoning through on-policy distillation and alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

On-policy distillation transfers reasoning from text models
Group Relative Policy Optimization enhances reinforcement learning
Cold-start alignment enables effective teacher-student knowledge transfer