Spava: Accelerating Long-Video Understanding via Sequence-Parallelism-aware Approximate Attention

📅 2026-01-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high computational overhead of the prefilling phase in long video reasoning with large multimodal models (LMMs), where existing acceleration methods struggle to balance efficiency and performance. We propose Spava, the first multi-GPU inference framework that integrates sequence parallelism with distributed approximate attention, enabling highly efficient and lossless inference without compressing visual embeddings. Through a sequence-parallel architecture, load balancing, and fused forward computation, Spava overcomes the limitations of single-GPU sparse attention or embedding compression. On long video understanding tasks, Spava achieves speedups of 12.72×, 1.70×, and 1.18× over FlashAttn, ZigZagRing, and APB, respectively, while incurring negligible performance degradation.

Technology Category

Application Category

📝 Abstract
The efficiency of long-video inference remains a critical bottleneck, mainly due to the dense computation in the prefill stage of Large Multimodal Models (LMMs). Existing methods either compress visual embeddings or apply sparse attention on a single GPU, yielding limited acceleration or degraded performance and restricting LMMs from handling longer, more complex videos. To overcome these issues, we propose Spava, a sequence-parallel framework with optimized attention that accelerates long-video inference across multiple GPUs. By distributing approximate attention, Spava reduces computation and increases parallelism, enabling efficient processing of more visual embeddings without compression and thereby improving task performance. System-level optimizations, such as load balancing and fused forward passes, further unleash the potential of Spava, delivering speedups of 12.72x, 1.70x, and 1.18x over FlashAttn, ZigZagRing, and APB, without notable performance loss. Code available at https://github.com/thunlp/APB
Problem

Research questions and friction points this paper is trying to address.

long-video inference
Large Multimodal Models
attention computation
sequence parallelism
visual embeddings
Innovation

Methods, ideas, or system contributions that make the work stand out.

sequence-parallelism
approximate attention
long-video understanding
multi-GPU acceleration
Large Multimodal Models
🔎 Similar Papers
No similar papers found.