$\text{PKS}^4$:Parallel Kinematic Selective State Space Scanners for Efficient Video Understanding

๐Ÿ“… 2026-04-29
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

239K/year
๐Ÿค– AI Summary
This work addresses the high computational cost of long-sequence modeling and the quadratic complexity of spatiotemporal attention in video understanding, as well as the limitations of existing efficient methods that either disrupt spatial structure or incur substantial memory overhead. To this end, the authors propose the PKSโด moduleโ€”a plug-and-play, parallel temporal scanning mechanism inserted after standard 2D vision backbones. By integrating motion-prior-guided state space models (SSMs), PKSโด achieves efficient spatiotemporal modeling with linear time complexity while preserving full spatial semantics. The method substantially reduces training costs and accelerates convergence, attaining state-of-the-art performance on spatiotemporal action recognition benchmarks in just 20 training epochs with approximately one-tenth the computation of pure video SSM approaches.
๐Ÿ“ Abstract
Temporal modeling remains a fundamental challenge in video understanding, particularly as sequence lengths scale. Traditional video models relying on dense spatiotemporal attention suffer from quadratic computational costs for long videos. To circumvent these costs, recent approaches adapt image models for videos via Parameter-Efficient Fine-Tuning (PEFT) methods such as adapters. However, deeply inserting these modules incurs prohibitive activation memory overhead during back-propagation. While recent efficient State Space Models (SSMs) introduce linear complexity, they disrupt 2D spatial relationships and rely on extensive masked pre-training to recover spatial awareness. To overcome these limitations, we propose Parallel Kinematic Selective State Space Scanners (PKS$^4$). We retain a standard 2D vision backbone for spatial semantics and insert a single plug-and-play PKS$^4$ module with linear-complexity temporal scanning, avoiding temporal attention and multi-layer adapters. We first extract kinematic priors via a Kinematic Prior Encoder, which captures local displacements and motion boundaries through inter-frame correlations and differences. These priors drive linear-complexity SSMs to track underlying kinematic states, adaptively modulating update speeds and read-write strategies at each time step. Instead of global scanning, we deploy parallel scanners along the temporal dimension for each spatial location, preserving spatial structures while reducing overhead. Experiments on spatial-heavy and temporal-heavy action recognition benchmarks show that PKS$^4$ achieves state-of-the-art performance. Remarkably, our method converges in merely $20$ epochs, achieving approximately $10\times$ lower training compute than pure video SSMs, establishing a new paradigm for efficient video understanding.
Problem

Research questions and friction points this paper is trying to address.

Temporal modeling
Video understanding
Computational efficiency
State Space Models
Spatiotemporal attention
Innovation

Methods, ideas, or system contributions that make the work stand out.

State Space Models
Temporal Modeling
Kinematic Priors
Linear Complexity
Parallel Scanning
๐Ÿ”Ž Similar Papers
2024-06-09Annual Meeting of the Association for Computational LinguisticsCitations: 13