AdaSpark: Adaptive Sparsity for Efficient Long-Video Understanding

📅 2026-04-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high computational cost of processing long videos in video large language models, where existing sparse approaches either suffer from irreversible information loss that impairs fine-grained perception or rely on fixed sparsification patterns that hinder long-range temporal modeling. To overcome these limitations, the authors propose AdaSpark, an adaptive sparsity framework that partitions input videos into 3D spatiotemporal cubes and introduces context-aware adaptive attention (AdaS-Attn) and feed-forward networks (AdaS-FFN). These components, coupled with an entropy-based Top-p mechanism, dynamically select salient regions and tokens during inference. Evaluated on hour-long video benchmarks, AdaSpark reduces FLOPs by up to 57% while preserving performance comparable to dense models and maintaining the ability to capture fine-grained long-range dependencies.
📝 Abstract
Processing long-form videos with Video Large Language Models (Video-LLMs) is computationally prohibitive. Current efficiency methods often compromise fine-grained perception through irreversible information disposal or inhibit long-range temporal modeling via rigid, predefined sparse patterns. This paper introduces AdaSpark, an adaptive sparsity framework designed to address these limitations. AdaSpark first partitions video inputs into 3D spatio-temporal cubes. It then employs two co-designed, context-aware components: (1) Adaptive Cube-Selective Attention (AdaS-Attn), which adaptively selects a subset of relevant video cubes to attend for each query token, and (2) Adaptive Token-Selective FFN (AdaS-FFN), which selectively processes only the most salient tokens within each cube. An entropy-based (Top-p) selection mechanism adaptively allocates computational resources based on input complexity. Experiments demonstrate that AdaSpark significantly reduces computational load by up to 57% FLOPs while maintaining comparable performance to dense models and preserving fine-grained, long-range dependencies, as validated on challenging hour-scale video benchmarks.
Problem

Research questions and friction points this paper is trying to address.

long-video understanding
computational efficiency
sparsity
fine-grained perception
temporal modeling
Innovation

Methods, ideas, or system contributions that make the work stand out.

adaptive sparsity
video understanding
long-video modeling
computational efficiency
context-aware selection
🔎 Similar Papers
No similar papers found.
Handong Li
Handong Li
Institute of Automation, Chinese Academy of Sciences
cross-modality pretraining
Zikang Liu
Zikang Liu
Institute of Automation, Chinese Academy of Sciences
Multimodal LearningComputer Vision
L
Longteng Guo
School of Artificial Intelligence, University of Chinese Academy of Sciences; Institute of Automation, Chinese Academy of Sciences
Tongtian Yue
Tongtian Yue
Institute of Automation, Chinese Academy of Sciences
Multimodal PretrainVisual-Language
Yepeng Tang
Yepeng Tang
Beijing Jiaotong University
VideoLLMVideo Understanding
X
Xinxin Zhu
School of Artificial Intelligence, University of Chinese Academy of Sciences; Institute of Automation, Chinese Academy of Sciences
C
Chuanyang Zheng
Alibaba Group Holding Limited; Future Living Lab of Alibaba
Z
Ziming Wang
Alibaba Group Holding Limited; Future Living Lab of Alibaba
Zhibin Wang
Zhibin Wang
Zhejiang University
new particle formationaerosolshygroscopicityblack carbon
Jun Song
Jun Song
Shenzhen University
nanophotonics
Cheng Yu
Cheng Yu
PhD student, CSE, Ohio State University
audiovisualspeech enhancementspeech separationonline systemdeep learning
Bo Zheng
Bo Zheng
Researcher, Alibaba Group
AINetworkE-Commerce
J
Jing Liu
School of Artificial Intelligence, University of Chinese Academy of Sciences; Institute of Automation, Chinese Academy of Sciences