🤖 AI Summary
Video understanding is hindered by the limited context length of multimodal models, leading to missed critical transition frames and poor temporal coherence. To address this, we propose Native Sparse Attention (NSA), a hardware-aware hybrid attention architecture: the text pathway retains dense attention, while the video pathway employs learnable sparse attention to enable dynamic attention aggregation and optimal global–local attention allocation. Built upon Qwen2.5-VL, our model is end-to-end trained on 216K video instruction samples and supports reliable scaling up to 128K tokens. On benchmarks for long-video understanding, temporal reasoning, and spatial localization, NSA significantly outperforms token-compression methods and untrained sparse baselines. Results demonstrate that NSA effectively balances fine-grained temporal modeling capability with scalable context length—achieving both high fidelity in motion and event dynamics and robust performance at extreme sequence lengths.
📝 Abstract
Video understanding in multimodal language models remains limited by context length: models often miss key transition frames and struggle to maintain coherence across long time scales. To address this, we adapt Native Sparse Attention (NSA) to video-language models. Our method, VideoNSA, adapts Qwen2.5-VL through end-to-end training on a 216K video instruction dataset. We employ a hardware-aware hybrid approach to attention, preserving dense attention for text, while employing NSA for video. Compared to token-compression and training-free sparse baselines, VideoNSA achieves improved performance on long-video understanding, temporal reasoning, and spatial benchmarks. Further ablation analysis reveals four key findings: (1) reliable scaling to 128K tokens; (2) an optimal global-local attention allocation at a fixed budget; (3) task-dependent branch usage patterns; and (4) the learnable combined sparse attention help induce dynamic attention sinks.