Sparrow: Text-Anchored Window Attention with Visual-Semantic Glimpsing for Speculative Decoding in Video LLMs

πŸ“… 2026-02-16
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the significant performance degradation commonly observed in video large language models when applying speculative decoding, primarily caused by attention dilution and negative visual gain. To overcome these limitations, the authors propose a novel paradigm that fully offloads visual computation to the target model through text-anchored windowed attention and a visual semantic snapshot mechanism. The draft model is trained using intermediate-layer visual states as bridges and incorporates a multi-token prediction strategy to mitigate the distribution shift between training and inference. By eliminating redundant raw visual inputs and leveraging the model’s internalized visual semantics, the method achieves an average 2.82Γ— speedup on long videos containing 25k visual tokens, substantially alleviating performance degradation and enabling real-time long-form video understanding.

Technology Category

Application Category

πŸ“ Abstract
Although speculative decoding is widely used to accelerate Vision-Language Models (VLMs) inference, it faces severe performance collapse when applied to Video Large Language Models (Vid-LLMs). The draft model typically falls into the trap of attention dilution and negative visual gain due to key-value cache explosion and context window mismatches. We observe a visual semantic internalization phenomenon in Vid-LLMs, indicating that critical visual semantics are implicitly encoded into text hidden states during deep-layer interactions, which renders raw visual inputs structurally redundant during deep inference. To address this, we propose the Sparrow framework, which first utilizes visually-aware text-anchored window attention via hidden state reuse to fully offload visual computation to the target model, and leverages intermediate-layer visual state bridging to train the draft model with semantic-rich intermediate states, thereby filtering out low-level visual noise. Additionally, a multi-token prediction strategy is introduced to bridge the training-inference distribution shift. Experiments show that Sparrow achieves an average speedup of 2.82x even with 25k visual tokens, effectively resolving the performance degradation in long sequences and offering a practical solution for real-time long video tasks.
Problem

Research questions and friction points this paper is trying to address.

speculative decoding
Video Large Language Models
attention dilution
visual gain
performance collapse
Innovation

Methods, ideas, or system contributions that make the work stand out.

speculative decoding
video LLMs
text-anchored window attention
visual-semantic internalization
multi-token prediction
πŸ”Ž Similar Papers
No similar papers found.
L
Libo Zhang
National Key Laboratory of Parallel and Distributed Computing, College of Computer Science and Technology, National University of Defense Technology, Changsha, China.
Zhaoning Zhang
Zhaoning Zhang
National University of Defense Technology
MLSysCompute VisionDistributed Computing
W
Wangyang Hong
National Key Laboratory of Parallel and Distributed Computing, College of Computer Science and Technology, National University of Defense Technology, Changsha, China.
Peng Qiao
Peng Qiao
National University of Defense Technology
image processingcomputer visionmachine learningdeep learning
Dongsheng Li
Dongsheng Li
Professor, School of Computer Science, National University of Defense Technology
Distributed ComputingParallel ComputingCloud ComputingPeer-to-Peer ComputingBig Data