Joint image-instance spatial-temporal attention for few-shot action recognition

📅 2025-02-01
🏛️ Computer Vision and Image Understanding
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In few-shot action recognition (FSAR), image-level features are vulnerable to background noise and neglect foreground action instances. To address this, we propose an image-instance dual-granularity spatiotemporal joint attention mechanism—the first to jointly model image-level and instance-level spatiotemporal dependencies within a dual-stream Transformer architecture—enabling fine-grained temporal alignment across video clips and discriminative feature focusing. Our method integrates prototype contrastive learning, dynamic frame sampling, and cross-instance relational distillation to substantially enhance few-shot generalization. Under standard few-shot protocols on UCF101 and HMDB51, our approach achieves 92.3% and 78.6% accuracy, respectively—surpassing prior state-of-the-art methods by 4.1% and 3.8%—while significantly reducing dependency on labeled samples.

Technology Category

Application Category

Problem

Research questions and friction points this paper is trying to address.

Recognize actions from limited examples in FSAR.
Address background noise in image-level features.
Integrate action-related instances with image features.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Joint Image-Instance Spatial-Temporal Attention
Action-related Instance Perception
Text-guided segmentation model
🔎 Similar Papers
No similar papers found.
Zefeng Qian
Zefeng Qian
PHD candidate, Shanghai Jiao Tong University
Computer VisionAction Recognition
Chongyang Zhang
Chongyang Zhang
Shanghai Jiao Tong University, Shanghai, China
Y
Yifei Huang
The University of Tokyo, Tokyo, Japan
G
Gang Wang
E-surfing Vision Technology Co., Ltd., Hangzhou, China
J
Jiangyong Ying
E-surfing Vision Technology Co., Ltd., Hangzhou, China