🤖 AI Summary
This work addresses the high computational cost of processing long videos in video large language models, where existing sparse approaches either suffer from irreversible information loss that impairs fine-grained perception or rely on fixed sparsification patterns that hinder long-range temporal modeling. To overcome these limitations, the authors propose AdaSpark, an adaptive sparsity framework that partitions input videos into 3D spatiotemporal cubes and introduces context-aware adaptive attention (AdaS-Attn) and feed-forward networks (AdaS-FFN). These components, coupled with an entropy-based Top-p mechanism, dynamically select salient regions and tokens during inference. Evaluated on hour-long video benchmarks, AdaSpark reduces FLOPs by up to 57% while preserving performance comparable to dense models and maintaining the ability to capture fine-grained long-range dependencies.
📝 Abstract
Processing long-form videos with Video Large Language Models (Video-LLMs) is computationally prohibitive. Current efficiency methods often compromise fine-grained perception through irreversible information disposal or inhibit long-range temporal modeling via rigid, predefined sparse patterns. This paper introduces AdaSpark, an adaptive sparsity framework designed to address these limitations. AdaSpark first partitions video inputs into 3D spatio-temporal cubes. It then employs two co-designed, context-aware components: (1) Adaptive Cube-Selective Attention (AdaS-Attn), which adaptively selects a subset of relevant video cubes to attend for each query token, and (2) Adaptive Token-Selective FFN (AdaS-FFN), which selectively processes only the most salient tokens within each cube. An entropy-based (Top-p) selection mechanism adaptively allocates computational resources based on input complexity. Experiments demonstrate that AdaSpark significantly reduces computational load by up to 57% FLOPs while maintaining comparable performance to dense models and preserving fine-grained, long-range dependencies, as validated on challenging hour-scale video benchmarks.