Unsupervised Video Domain Adaptation with Masked Pre-Training and Collaborative Self-Training

📅 2023-12-05
🏛️ Computer Vision and Pattern Recognition
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses unsupervised domain adaptation (UDA) for video action recognition, aiming to enhance cross-domain generalization without target-domain labels. The proposed UNITE framework introduces two key innovations: (1) teacher-guided masked distillation pretraining, where an image-based teacher model supervises a video student model to learn robust spatiotemporal representations; and (2) a teacher-student collaborative self-training mechanism integrating dynamic pseudo-label optimization and confidence-weighted consistency regularization, significantly improving pseudo-label quality and adaptation stability. UNITE synergistically combines self-supervised masked modeling, cross-modal knowledge distillation, and iterative self-training. Evaluated on standard video UDA benchmarks—including UCF-HMDB and Something-Something—UNITE consistently outperforms state-of-the-art methods, achieving average accuracy gains of 4.2–7.8 percentage points.
📝 Abstract
In this work, we tackle the problem of unsupervised domain adaptation (UDA) for video action recognition. Our approach, which we call UNITE, uses an image teacher model to adapt a video student model to the target domain. UNITE first employs self-supervised pretraining to promote discriminative feature learning on target domain videos using a teacher-guided masked distillation objective. We then perform self-training on masked target data, using the video student model and image teacher model together to generate improved pseudolabels for unlabeled target videos. Our self-training process successfully leverages the strengths of both models to achieve strong transfer performance across domains. We evaluate our approach on multiple video domain adaptation benchmarks and observe significant improvements upon previously reported results.
Problem

Research questions and friction points this paper is trying to address.

Unsupervised domain adaptation for video action recognition
Adapting video models using image teacher models
Improving pseudolabels for unlabeled target videos
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised pre-training with masked distillation
Collaborative self-training using image and video models
Improved pseudolabels generation for unlabeled videos
🔎 Similar Papers
No similar papers found.