VideoNSA: Native Sparse Attention Scales Video Understanding

📅 2025-10-02

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Video understanding is hindered by the limited context length of multimodal models, leading to missed critical transition frames and poor temporal coherence. To address this, we propose Native Sparse Attention (NSA), a hardware-aware hybrid attention architecture: the text pathway retains dense attention, while the video pathway employs learnable sparse attention to enable dynamic attention aggregation and optimal global–local attention allocation. Built upon Qwen2.5-VL, our model is end-to-end trained on 216K video instruction samples and supports reliable scaling up to 128K tokens. On benchmarks for long-video understanding, temporal reasoning, and spatial localization, NSA significantly outperforms token-compression methods and untrained sparse baselines. Results demonstrate that NSA effectively balances fine-grained temporal modeling capability with scalable context length—achieving both high fidelity in motion and event dynamics and robust performance at extreme sequence lengths.

Technology Category

Application Category

📝 Abstract

Video understanding in multimodal language models remains limited by context length: models often miss key transition frames and struggle to maintain coherence across long time scales. To address this, we adapt Native Sparse Attention (NSA) to video-language models. Our method, VideoNSA, adapts Qwen2.5-VL through end-to-end training on a 216K video instruction dataset. We employ a hardware-aware hybrid approach to attention, preserving dense attention for text, while employing NSA for video. Compared to token-compression and training-free sparse baselines, VideoNSA achieves improved performance on long-video understanding, temporal reasoning, and spatial benchmarks. Further ablation analysis reveals four key findings: (1) reliable scaling to 128K tokens; (2) an optimal global-local attention allocation at a fixed budget; (3) task-dependent branch usage patterns; and (4) the learnable combined sparse attention help induce dynamic attention sinks.

Problem

Research questions and friction points this paper is trying to address.

Addresses limited context length in video understanding models

Improves coherence across long video time scales

Enhances performance on temporal and spatial reasoning tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Native Sparse Attention scales video understanding

Hybrid approach preserves dense text attention

Learns dynamic attention sinks through combined sparse attention

🔎 Similar Papers

VideoPrism: A Foundational Visual Encoder for Video Understanding