MVP-Shot: Multi-Velocity Progressive-Alignment Framework for Few-Shot Action Recognition

πŸ“… 2024-05-03
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 3
✨ Influential: 0
πŸ“„ PDF

career value

243K/year
πŸ€– AI Summary
In few-shot action recognition, single-scale feature alignment (e.g., frame- or segment-level) fails due to temporal speed variations among instances of the same action class. To address this, we propose the Multi-Speed Progressive Alignment (MS-PA) frameworkβ€”the first to jointly model multi-granularity motion semantics (e.g., frame-level and snippet-level) under few-shot settings. Our approach comprises three key components: (1) a Multi-Speed Feature Alignment (MVFA) module that explicitly captures motion dynamics across diverse temporal scales; (2) a Progressive Semantic-Tailored Interaction (PSTI) module enabling hierarchical, cross-scale feature fusion guided by semantic cues; and (3) a cross-domain (channel- and time-wise) text-guided feature fusion mechanism to enhance semantic discriminability. Extensive experiments demonstrate state-of-the-art performance on HMDB51, UCF101, Kinetics, and SSv2-small, validating significant improvements in speed robustness and generalization capability.

Technology Category

Application Category

πŸ“ Abstract
Recent few-shot action recognition (FSAR) methods typically perform semantic matching on learned discriminative features to achieve promising performance. However, most FSAR methods focus on single-scale (e.g., frame-level, segment-level, etc) feature alignment, which ignores that human actions with the same semantic may appear at different velocities. To this end, we develop a novel Multi-Velocity Progressive-alignment (MVP-Shot) framework to progressively learn and align semantic-related action features at multi-velocity levels. Concretely, a Multi-Velocity Feature Alignment (MVFA) module is designed to measure the similarity between features from support and query videos with different velocity scales and then merge all similarity scores in a residual fashion. To avoid the multiple velocity features deviating from the underlying motion semantic, our proposed Progressive Semantic-Tailored Interaction (PSTI) module injects velocity-tailored text information into the video feature via feature interaction on channel and temporal domains at different velocities. The above two modules compensate for each other to make more accurate query sample predictions under the few-shot settings. Experimental results show our method outperforms current state-of-the-art methods on multiple standard few-shot benchmarks (i.e., HMDB51, UCF101, Kinetics, and SSv2-small).
Problem

Research questions and friction points this paper is trying to address.

Addresses few-shot action recognition challenges
Aligns multi-velocity action features progressively
Improves accuracy in few-shot benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Velocity Feature Alignment for diverse speeds
Progressive Semantic-Tailored Interaction with text
Residual merging of multi-velocity similarity scores
H
Hongyu Qu
School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
R
Rui Yan
Department of Computer Science and Technology, Nanjing University, Nanjing 210023, China
X
Xiangbo Shu
School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
H
Haoliang Gao
School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
P
Peng Huang
School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
Guo-Sen Xie
Guo-Sen Xie
Professor, Nanjing University of Science and Technology
Computer VisionMachine Learning