🤖 AI Summary
To address the increased intra-class variance caused by long video subsequences and the high computational cost of mainstream Transformer-based approaches in few-shot action recognition (FSAR), this paper proposes Matryoshka Mamba—a dual-module state-space architecture comprising inner and outer modules. The inner module enhances fine-grained feature representation via multi-granularity local modeling, while the outer module enables implicit temporal alignment. Furthermore, we introduce a hybrid supervised-unsupervised contrastive learning framework (SupCon + SimCLR) to explicitly suppress intra-class variance accumulation. Evaluated on SSv2, Kinetics, UCF101, and HMDB51, our method achieves new state-of-the-art performance in FSAR. Notably, it delivers significant improvements in accuracy and robustness under long-subsequence settings—marking the first successful adaptation of the Mamba architecture to efficient, high-accuracy few-shot action understanding.
📝 Abstract
In few-shot action recognition (FSAR), long sub-sequences of video naturally express entire actions more effectively. However, the high computational complexity of mainstream Transformer-based methods limits their application. Recent Mamba demonstrates efficiency in modeling long sequences, but directly applying Mamba to FSAR overlooks the importance of local feature modeling and alignment. Moreover, long sub-sequences within the same class accumulate intra-class variance, which adversely impacts FSAR performance. To solve these challenges, we propose a Matryoshka MAmba and CoNtrasTive LeArning framework (Manta). Firstly, the Matryoshka Mamba introduces multiple Inner Modules to enhance local feature representation, rather than directly modeling global features. An Outer Module captures dependencies of timeline between these local features for implicit temporal alignment. Secondly, a hybrid contrastive learning paradigm, combining both supervised and unsupervised methods, is designed to mitigate the negative effects of intra-class variance accumulation. The Matryoshka Mamba and the hybrid contrastive learning paradigm operate in two parallel branches within Manta, enhancing Mamba for FSAR of long sub-sequence. Manta achieves new state-of-the-art performance on prominent benchmarks, including SSv2, Kinetics, UCF101, and HMDB51. Extensive empirical studies prove that Manta significantly improves FSAR of long sub-sequence from multiple perspectives.