Collaboratively Self-supervised Video Representation Learning for Action Recognition

📅 2024-01-15
🏛️ IEEE Transactions on Information Forensics and Security
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the limited representation capacity in unsupervised video action recognition, this paper proposes a collaborative self-supervised video representation learning framework. Methodologically, it introduces a three-branch joint architecture that simultaneously models human pose generation (a generative pretext task) and contextual discrimination (a contrastive pretext task). An end-to-end video generation branch, driven by a Conditional GAN, jointly optimizes dynamic motion features and static scene features; additionally, an I-frame feature matching mechanism is incorporated to enhance spatiotemporal consistency. This work pioneers a dual-path “generation + discrimination” collaborative pretraining paradigm, overcoming representational limitations inherent in single-pretext-task approaches. The framework achieves state-of-the-art performance on mainstream benchmarks including UCF101 and HMDB51, and significantly improves zero-shot and few-shot action recognition accuracy.

Technology Category

Application Category

📝 Abstract
Considering the close connection between action recognition and human pose estimation, we design a Collaboratively Self-supervised Video Representation (CSVR) learning framework specific to action recognition by jointly factoring in generative pose prediction and discriminative context matching as pretext tasks. Specifically, our CSVR consists of three branches: a generative pose prediction branch, a discriminative context matching branch, and a video generating branch. Among them, the first one encodes dynamic motion feature by utilizing Conditional-GAN to predict the human poses of future frames, and the second branch extracts static context features by contrasting positive and negative video feature and I-frame feature pairs. The third branch is designed to generate both current and future video frames, for the purpose of collaboratively improving dynamic motion features and static context features. Extensive experiments demonstrate that our method achieves state-of-the-art performance on multiple popular video datasets.
Problem

Research questions and friction points this paper is trying to address.

Action Recognition
Video Analysis
Machine Learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

CSVR
Dual-task Learning
Self-supervised Action Recognition
🔎 Similar Papers
J
Jie Zhang
Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences
Z
Zhifan Wan
Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences
L
Lanqing Hu
Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences
Stephen Lin
Stephen Lin
Microsoft Research Asia
computer vision
Shuzhe Wu
Shuzhe Wu
Institute of Computing Technology, Chinese Academy of Sciences
Computer VisionMachine Learning
Shiguang Shan
Shiguang Shan
Professor of Institute of Computing Technology, Chinese Academy of Sciences
Computer VisionPattern RecognitionMachine LearningFace Recognition