🤖 AI Summary
This work addresses the high computational cost of chain-of-thought reasoning in existing vision-language models (VLMs) and the difficulty student models face in effectively leveraging visual evidence during knowledge distillation. To this end, the authors propose a novel thought-answer distillation framework that applies token-wise masking to salient reasoning prefixes, compelling the student model to rely more on visual inputs rather than textual cues. The approach innovatively integrates an adaptive masking schedule—guided by the distributional discrepancy between teacher and student—with a tailored attention masking mechanism, and further introduces a self-paced masking budget strategy to strengthen the student’s grounding in visual content. Experiments demonstrate that the method outperforms current open-source VLMs and their distillation or self-distillation variants on multimodal reasoning benchmarks, significantly enhancing the student model’s ability to utilize visual information.
📝 Abstract
Recent think-answer approaches in VLMs, such as Qwen3-VL-Thinking, boost reasoning performance by leveraging intermediate thinking steps before the final answer, but their high computational cost limits real-world deployment. To distill such capabilities into compact think-answer VLMs, a primary objective is to improve the student's ability to utilize visual evidence throughout its reasoning trace. To this end, we introduce a novel think-answer distillation framework that encourages the student to anchor its thinking on visual information by masking the student's salient reasoning prefixes. To compensate for such masked textual cues, the student is encouraged to rely more on visual evidence as an alternative source of information during distillation. Our masking strategies include: 1) token-wise salient reasoning-prefix masking, which masks high-influence reasoning prefixes selectively for each next-token prediction, and 2) self-paced masking budget scheduling, which gradually increases the masking scale according to distillation difficulty, {measured by discrepancy between teacher--student distributions. In the distillation phase, the student is guided by our salient reasoning-prefix mask, which blocks both future tokens and salient reasoning cues, in place of the standard causal mask used for auto-regressive language modeling. Experimental results show that our approach outperforms recent open-source VLMs, VLM distillation, and self-distillation methods on multimodal reasoning benchmarks, while further analyses confirm enhanced visual utilization along the student thinking process.