🤖 AI Summary
Few-shot action recognition aims to accurately classify video actions using only a few labeled examples per class, addressing the high annotation cost and complexity of temporal modeling in videos. This paper presents a systematic survey of the field and introduces the first taxonomy of few-shot methods specifically designed for video actions, proposing an innovative dual-track framework integrating generative modeling and meta-learning. Within the meta-learning paradigm, we uniquely decouple three technical dimensions: video instance representation, class prototype learning, and generalized video alignment. Our unified methodology integrates video representation learning, prototype/matching networks, adversarial generative modeling, and cross-video temporal alignment. We conduct comprehensive evaluations on standard benchmarks—including UCF101, HMDB51, and Kinetics—demonstrating that generalizable temporal modeling and self-supervised pretraining constitute critical directions for future advancement.
📝 Abstract
Few-shot action recognition aims to address the high cost and impracticality of manually labeling complex and variable video data in action recognition. It requires accurately classifying human actions in videos using only a few labeled examples per class. Compared to few-shot learning in image scenarios, few-shot action recognition is more challenging due to the intrinsic complexity of video data. Numerous approaches have driven significant advancements in few-shot action recognition, which underscores the need for a comprehensive survey. Unlike early surveys that focus on few-shot image or text classification, we deeply consider the unique challenges of few-shot action recognition. In this survey, we provide a comprehensive review of recent methods and introduce a novel and systematic taxonomy of existing approaches, accompanied by a detailed analysis. We categorize the methods into generative-based and meta-learning frameworks, and further elaborate on the methods within the meta-learning framework, covering aspects: video instance representation, category prototype learning, and generalized video alignment. Additionally, the survey presents the commonly used benchmarks and discusses relevant advanced topics and promising future directions. We hope this survey can serve as a valuable resource for researchers, offering essential guidance to newcomers and stimulating seasoned researchers with fresh insights.