Towards Scalable Modeling of Compressed Videos for Efficient Action Recognition

📅 2025-03-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high decoding overhead, weak temporal modeling, and underutilization of explicit modality sparsity in compressed-video action recognition, this paper proposes the first end-to-end learning framework operating directly on I-frames, P-frame motion vectors, and residuals. Methodologically, it introduces a triple-decoupled architecture: (1) a dual-stream spiking encoder for energy-efficient temporal modeling; (2) global self-attention for cross-modal fusion to enhance spatiotemporal consistency; and (3) a multimodal token-mixing block for joint embedding of heterogeneous compressed-domain features. Evaluated on five major benchmarks—from UCF-101 to Something-Something v2—the framework achieves state-of-the-art accuracy. It improves inference speed by 56×, reduces computational cost by 330×, consumes only 0.73 J per video, and attains a throughput of 16 videos per second.

Technology Category

Application Category

📝 Abstract
Training robust deep video representations has proven to be computationally challenging due to substantial decoding overheads, the enormous size of raw video streams, and their inherent high temporal redundancy. Different from existing schemes, operating exclusively in the compressed video domain and exploiting all freely available modalities, i.e., I-frames, and P-frames (motion vectors and residuals) offers a compute-efficient alternative. Existing methods approach this task as a naive multi-modality problem, ignoring the temporal correlation and implicit sparsity across P-frames for modeling stronger shared representations for videos of the same action, making training and generalization easier. By revisiting the high-level design of dominant video understanding backbones, we increase inference speed by a factor of $56$ while retaining similar performance. For this, we propose a hybrid end-to-end framework that factorizes learning across three key concepts to reduce inference cost by $330 imes$ versus prior art: First, a specially designed dual-encoder scheme with efficient Spiking Temporal Modulators to minimize latency while retaining cross-domain feature aggregation. Second, a unified transformer model to capture inter-modal dependencies using global self-attention to enhance I-frame -- P-frame contextual interactions. Third, a Multi-Modal Mixer Block to model rich representations from the joint spatiotemporal token embeddings. Experiments show that our method results in a lightweight architecture achieving state-of-the-art video recognition performance on UCF-101, HMDB-51, K-400, K-600 and SS-v2 datasets with favorable costs ($0.73$J/V) and fast inference ($16$V/s). Our observations bring new insights into practical design choices for efficient next-generation spatiotemporal learners. Code is available.
Problem

Research questions and friction points this paper is trying to address.

Reduces computational overhead in video action recognition.
Exploits compressed video modalities for efficient training.
Enhances temporal correlation and sparsity in video modeling.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-encoder scheme with Spiking Temporal Modulators
Unified transformer model for inter-modal dependencies
Multi-Modal Mixer Block for spatiotemporal token embeddings
🔎 Similar Papers
No similar papers found.