A Comprehensive Review of Few-shot Action Recognition

📅 2024-07-20
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Few-shot action recognition aims to accurately classify video actions using only a few labeled examples per class, addressing the high annotation cost and complexity of temporal modeling in videos. This paper presents a systematic survey of the field and introduces the first taxonomy of few-shot methods specifically designed for video actions, proposing an innovative dual-track framework integrating generative modeling and meta-learning. Within the meta-learning paradigm, we uniquely decouple three technical dimensions: video instance representation, class prototype learning, and generalized video alignment. Our unified methodology integrates video representation learning, prototype/matching networks, adversarial generative modeling, and cross-video temporal alignment. We conduct comprehensive evaluations on standard benchmarks—including UCF101, HMDB51, and Kinetics—demonstrating that generalizable temporal modeling and self-supervised pretraining constitute critical directions for future advancement.

Technology Category

Application Category

📝 Abstract
Few-shot action recognition aims to address the high cost and impracticality of manually labeling complex and variable video data in action recognition. It requires accurately classifying human actions in videos using only a few labeled examples per class. Compared to few-shot learning in image scenarios, few-shot action recognition is more challenging due to the intrinsic complexity of video data. Numerous approaches have driven significant advancements in few-shot action recognition, which underscores the need for a comprehensive survey. Unlike early surveys that focus on few-shot image or text classification, we deeply consider the unique challenges of few-shot action recognition. In this survey, we provide a comprehensive review of recent methods and introduce a novel and systematic taxonomy of existing approaches, accompanied by a detailed analysis. We categorize the methods into generative-based and meta-learning frameworks, and further elaborate on the methods within the meta-learning framework, covering aspects: video instance representation, category prototype learning, and generalized video alignment. Additionally, the survey presents the commonly used benchmarks and discusses relevant advanced topics and promising future directions. We hope this survey can serve as a valuable resource for researchers, offering essential guidance to newcomers and stimulating seasoned researchers with fresh insights.
Problem

Research questions and friction points this paper is trying to address.

Reducing manual labeling costs for video action recognition
Classifying human actions with minimal labeled examples
Addressing video complexity in few-shot learning scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generative-based and meta-learning frameworks
Video instance representation techniques
Category prototype learning methods
🔎 Similar Papers
No similar papers found.
Y
Yuyang Wanyan
State Key Laboratory of Multimodal Artificial Intelligence Systems, the Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China, and also with the School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China
X
Xiaoshan Yang
State Key Laboratory of Multimodal Artificial Intelligence Systems, the Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China, and also with the School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China, and also with the Peng Cheng Laboratory, Shenzhen 518066, China
W
Weiming Dong
State Key Laboratory of Multimodal Artificial Intelligence Systems, the Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China, and also with the School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China
Changsheng Xu
Changsheng Xu
Professor, Institute of Automation, Chinese Academy of Sciences
MultimediaComputer vision