Visual-Advantage On-Policy Distillation for Vision-Language Models

📅 2026-05-20

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

This work addresses a key limitation in existing online policy distillation methods for vision-language models: their inability to effectively enhance the student model’s reliance on visually salient tokens. To remedy this, the authors propose a novel metric termed Visual Advantage (VA) to identify tokens carrying strong visual supervision signals. Leveraging VA, they introduce a two-granularity distillation objective—reweighting training samples at the trajectory level based on average VA and computing group-wise KL divergence at the token level according to high- versus low-VA tokens. This approach explicitly differentiates visually critical tokens from linguistic scaffolding tokens, preventing visual signals from being diluted by abundant language tokens and enabling more precise knowledge transfer. Experiments on the Qwen3-VL model family demonstrate consistent improvements over standard online distillation across eight benchmarks spanning mathematical reasoning and visual understanding, with performance gains monotonically increasing with both teacher model scale and training data volume.

📝 Abstract

On-policy knowledge distillation has proven effective for language models, yet its application to vision-language models (VLMs) remains underexplored. We observe that standard on-policy distillation can improve a student's output quality while failing to strengthen its reliance on visual input: on vision-critical tokens, the student's predictions remain largely unchanged whether or not fine-grained visual detail is present, even though the teacher's predictions depend heavily on it.To make this difference observable, we introduce visual advantage (VA), the token-level log-probability difference when the teacher scores a student-generated rollout with versus without access to fine-grained visual detail. VA is concentrated in a small minority of tokens, and these high-VA tokens are the ones that actually carry the visual supervision signal. This motivates a distillation objective that treats them differently from language scaffolding, so their contribution is not diluted by the abundant surrounding language tokens.We propose Visual-Advantage On-Policy Distillation (VA-OPD), which uses VA at two granularities: rollout-level reweighting by trajectory-averaged VA, and token-level KL averaged within high-VA and low-VA groups separately. We train on two math datasets (Geometry3K and ViRL39K) and evaluate on eight benchmarks covering both mathematical reasoning and visual understanding, across three teacher sizes (4B, 8B, and 32B) on the Qwen3-VL family. VA-OPD improves over standard on-policy distillation on every benchmark, with the gain growing monotonically along both the teacher-size and data-scale axes, suggesting that these factors compound consistently.

Problem

Research questions and friction points this paper is trying to address.

vision-language models

on-policy distillation

visual dependence

knowledge distillation

visual supervision

Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual-Advantage

On-Policy Distillation

Vision-Language Models