Spava: Accelerating Long-Video Understanding via Sequence-Parallelism-aware Approximate Attention

📅 2026-01-29

📈 Citations: 0

✨ Influential: 0

career value

260K/year

🤖 AI Summary

This work addresses the high computational overhead of the prefilling phase in long video reasoning with large multimodal models (LMMs), where existing acceleration methods struggle to balance efficiency and performance. We propose Spava, the first multi-GPU inference framework that integrates sequence parallelism with distributed approximate attention, enabling highly efficient and lossless inference without compressing visual embeddings. Through a sequence-parallel architecture, load balancing, and fused forward computation, Spava overcomes the limitations of single-GPU sparse attention or embedding compression. On long video understanding tasks, Spava achieves speedups of 12.72×, 1.70×, and 1.18× over FlashAttn, ZigZagRing, and APB, respectively, while incurring negligible performance degradation.

Technology Category

Application Category

📝 Abstract

The efficiency of long-video inference remains a critical bottleneck, mainly due to the dense computation in the prefill stage of Large Multimodal Models (LMMs). Existing methods either compress visual embeddings or apply sparse attention on a single GPU, yielding limited acceleration or degraded performance and restricting LMMs from handling longer, more complex videos. To overcome these issues, we propose Spava, a sequence-parallel framework with optimized attention that accelerates long-video inference across multiple GPUs. By distributing approximate attention, Spava reduces computation and increases parallelism, enabling efficient processing of more visual embeddings without compression and thereby improving task performance. System-level optimizations, such as load balancing and fused forward passes, further unleash the potential of Spava, delivering speedups of 12.72x, 1.70x, and 1.18x over FlashAttn, ZigZagRing, and APB, without notable performance loss. Code available at https://github.com/thunlp/APB

Problem

Research questions and friction points this paper is trying to address.

long-video inference

Large Multimodal Models

attention computation

sequence parallelism

visual embeddings

Innovation

Methods, ideas, or system contributions that make the work stand out.

sequence-parallelism

approximate attention

long-video understanding