SOAP: Enhancing Spatio-Temporal Relation and Motion Information Capturing for Few-Shot Action Recognition

📅 2024-07-23
🏛️ ACM Multimedia
📈 Citations: 6
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of weak spatiotemporal relationships and insufficient motion information density in few-shot action recognition (FSAR) for high-frame-rate (HFR) videos, this paper proposes the Frame-Tuple Augmentation Architecture (SOAP). First, a multi-frame-tuple motion encoder explicitly models dense temporal motion dynamics. Second, a channel-wise temporal connection module jointly optimizes cross-channel temporal dependencies and spatial semantics. Third, a plug-and-play spatiotemporal fusion network—SOAP-Net—enables end-to-end unified modeling. SOAP overcomes two key limitations of conventional FSAR: (1) the decoupling of spatial and temporal features, and (2) reliance on single adjacent-frame motion cues. Extensive experiments demonstrate state-of-the-art performance on SthSthV2, Kinetics, UCF101, and HMDB51, with strong generalization, robustness to frame-rate variations, and seamless integration into existing pipelines.

Technology Category

Application Category

📝 Abstract
High frame-rate~(HFR) videos of action recognition improve fine-grained expression while reducing the spatio-temporal relation and motion information density. Thus, large amounts of video samples are continuously required for traditional data-driven training. However, samples are not always sufficient in real-world scenarios, promoting few-shot action recognition~(FSAR) research. We observe that most recent FSAR works build spatio-temporal relation of video samples via temporal alignment after spatial feature extraction, cutting apart spatial and temporal features within samples. They also capture motion information via narrow perspectives between adjacent frames without considering density, leading to insufficient motion information capturing. Therefore, we propose a novel plug-and-play architecture for FSAR called Spatio-tempOral frAme tuPle enhancer (SOAP) in this paper. The model we designed with such architecture refers to SOAP-Net. Temporal connections between different feature channels and spatio-temporal relation of features are considered instead of simple feature extraction. Comprehensive motion information is also captured, using frame tuples with multiple frames containing more motion information than adjacent frames. Combining frame tuples of diverse frame counts further provides a broader perspective. SOAP-Net achieves new state-of-the-art performance across well-known benchmarks such as SthSthV2, Kinetics, UCF101, and HMDB51. Extensive empirical evaluations underscore the competitiveness, pluggability, generalization, and robustness of SOAP. The code is released at https://github.com/wenbohuang1002/SOAP.
Problem

Research questions and friction points this paper is trying to address.

Few-shot action recognition struggles with limited video samples availability
Existing methods separate spatial and temporal features during processing
Current approaches capture insufficient motion information between adjacent frames
Innovation

Methods, ideas, or system contributions that make the work stand out.

Enhances spatio-temporal relations via frame tuples
Captures motion using multiple frames comprehensively
Pluggable architecture for few-shot action recognition
🔎 Similar Papers
No similar papers found.