MVP-Shot: Multi-Velocity Progressive-Alignment Framework for Few-Shot Action Recognition

📅 2024-05-03

🏛️ arXiv.org

📈 Citations: 3

✨ Influential: 0

career value

229K/year

🤖 AI Summary

In few-shot action recognition, single-scale feature alignment (e.g., frame- or segment-level) fails due to temporal speed variations among instances of the same action class. To address this, we propose the Multi-Speed Progressive Alignment (MS-PA) framework—the first to jointly model multi-granularity motion semantics (e.g., frame-level and snippet-level) under few-shot settings. Our approach comprises three key components: (1) a Multi-Speed Feature Alignment (MVFA) module that explicitly captures motion dynamics across diverse temporal scales; (2) a Progressive Semantic-Tailored Interaction (PSTI) module enabling hierarchical, cross-scale feature fusion guided by semantic cues; and (3) a cross-domain (channel- and time-wise) text-guided feature fusion mechanism to enhance semantic discriminability. Extensive experiments demonstrate state-of-the-art performance on HMDB51, UCF101, Kinetics, and SSv2-small, validating significant improvements in speed robustness and generalization capability.

Technology Category

Application Category

📝 Abstract

Recent few-shot action recognition (FSAR) methods typically perform semantic matching on learned discriminative features to achieve promising performance. However, most FSAR methods focus on single-scale (e.g., frame-level, segment-level, etc) feature alignment, which ignores that human actions with the same semantic may appear at different velocities. To this end, we develop a novel Multi-Velocity Progressive-alignment (MVP-Shot) framework to progressively learn and align semantic-related action features at multi-velocity levels. Concretely, a Multi-Velocity Feature Alignment (MVFA) module is designed to measure the similarity between features from support and query videos with different velocity scales and then merge all similarity scores in a residual fashion. To avoid the multiple velocity features deviating from the underlying motion semantic, our proposed Progressive Semantic-Tailored Interaction (PSTI) module injects velocity-tailored text information into the video feature via feature interaction on channel and temporal domains at different velocities. The above two modules compensate for each other to make more accurate query sample predictions under the few-shot settings. Experimental results show our method outperforms current state-of-the-art methods on multiple standard few-shot benchmarks (i.e., HMDB51, UCF101, Kinetics, and SSv2-small).

Problem

Research questions and friction points this paper is trying to address.

Addresses few-shot action recognition challenges

Aligns multi-velocity action features progressively

Improves accuracy in few-shot benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Velocity Feature Alignment for diverse speeds

Progressive Semantic-Tailored Interaction with text

Residual merging of multi-velocity similarity scores

🔎 Similar Papers

A Comprehensive Review of Few-shot Action Recognition