Towards Long-window Anchoring in Vision-Language Model Distillation

📅 2025-12-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Small vision-language models (VLMs) suffer from limited context windows, hindering long-range vision–language alignment; while large VLMs support extended contexts, their lightweight variants fail to inherit this capability effectively. To address this, we propose LAid, a novel knowledge distillation framework introducing *long-window anchoring distillation*—the first of its kind. LAid integrates two synergistic mechanisms: (i) dynamic distance-weighted attention matching and (ii) learnable RoPE-based response gain modulation. Through spectral analysis and explicit long-range attention modeling, LAid identifies and transfers position-aware low-frequency attention components—demonstrating their previously unrecognized transferability. Experimental results show that distilled small VLMs achieve an effective context window 3.2× larger than baselines, while maintaining or improving performance across major vision–language benchmarks. Crucially, LAid preserves low-frequency attention structures significantly better than conventional distillation methods.

Technology Category

Application Category

📝 Abstract
While large vision-language models (VLMs) demonstrate strong long-context understanding, their prevalent small branches fail on linguistics-photography alignment for a limited window size. We discover that knowledge distillation improves students' capability as a complement to Rotary Position Embeddings (RoPE) on window sizes (anchored from large models). Building on this insight, we propose LAid, which directly aims at the transfer of long-range attention mechanisms through two complementary components: (1) a progressive distance-weighted attention matching that dynamically emphasizes longer position differences during training, and (2) a learnable RoPE response gain modulation that selectively amplifies position sensitivity where needed. Extensive experiments across multiple model families demonstrate that LAid-distilled models achieve up to 3.2 times longer effective context windows compared to baseline small models, while maintaining or improving performance on standard VL benchmarks. Spectral analysis also suggests that LAid successfully preserves crucial low-frequency attention components that conventional methods fail to transfer. Our work not only provides practical techniques for building more efficient long-context VLMs but also offers theoretical insights into how positional understanding emerges and transfers during distillation.
Problem

Research questions and friction points this paper is trying to address.

Enhances small VLMs' long-context alignment via distillation
Transfers long-range attention mechanisms to overcome window limits
Improves effective context windows while maintaining benchmark performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Progressive distance-weighted attention matching for long-range transfer
Learnable RoPE response gain modulation for selective amplification
Preserving low-frequency attention components via spectral analysis
🔎 Similar Papers
No similar papers found.
Haoyi Zhou
Haoyi Zhou
Associate Professor, Beihang University
Machine LearningData MiningTime-series
S
Shuo Li
SKLCCSE, School of Computer Science and Engineering, Beihang University, Beijing , China
T
Tianyu Chen
SKLCCSE, School of Computer Science and Engineering, Beihang University, Beijing , China
Q
Qi Song
School of Software, Beihang University, Beijing, China
C
Chonghan Gao
SKLCCSE, School of Computer Science and Engineering, Beihang University, Beijing , China
J
Jianxin Li
SKLCCSE, School of Computer Science and Engineering, Beihang University, Beijing , China; Zhongguancun Laboratory, Beijing, China