Masking Teacher and Reinforcing Student for Distilling Vision-Language Models

📅 2025-12-23

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

To address training instability and performance degradation in vision-language model (VLM) knowledge distillation caused by the teacher–student capacity gap, this paper proposes Masters, a masked progressive reinforcement learning distillation framework. Methodologically, Masters introduces (1) a masked teacher progressive recovery mechanism—employing iterative weight pruning followed by controlled capacity expansion—to mitigate representation mismatch between teacher and student; and (2) a dual-reward offline reinforcement learning strategy that jointly optimizes response accuracy and distillability, eliminating the need for costly online chain-of-thought generation. Evaluated on multimodal understanding benchmarks, Masters consistently enhances small-model performance, improves training stability, and increases response precision. The framework establishes a novel paradigm for efficient, scalable VLM distillation, advancing the state of the art in compact multimodal model learning.

Technology Category

Application Category

📝 Abstract

Large-scale vision-language models (VLMs) have recently achieved remarkable multimodal understanding, but their massive size makes them impractical for deployment on mobile or edge devices. This raises the need for compact yet capable VLMs that can efficiently learn from powerful large teachers. However, distilling knowledge from a large teacher to a small student remains challenging due to their large size gap: the student often fails to reproduce the teacher's complex, high-dimensional representations, leading to unstable learning and degraded performance. To address this, we propose Masters (Masking Teacher and Reinforcing Student), a mask-progressive reinforcement learning (RL) distillation framework. Masters first masks non-dominant weights of the teacher to reduce unnecessary complexity, then progressively restores the teacher by gradually increasing its capacity during training. This strategy allows the student to learn richer representations from the teacher in a smooth and stable manner. To further refine knowledge transfer, Masters integrates an offline RL stage with two complementary rewards: an accuracy reward that measures the correctness of the generated responses, and a distillation reward that quantifies the ease of transferring responses from teacher to student. Unlike online think-answer RL paradigms that are computationally expensive and generate lengthy responses, our offline RL leverages pre-generated responses from masked teachers. These provide rich yet efficient guidance, enabling students to achieve strong performance without requiring the think-answer process.

Problem

Research questions and friction points this paper is trying to address.

Distill large vision-language models into compact ones for mobile deployment

Address unstable learning from teacher-student size gap in model distillation

Enhance knowledge transfer with efficient offline reinforcement learning rewards

Innovation

Methods, ideas, or system contributions that make the work stand out.

Masking teacher weights to reduce complexity

Progressive restoration of teacher capacity

Offline RL with accuracy and distillation rewards

🔎 Similar Papers

No similar papers found.