Transfer-learning for video classification: Video Swin Transformer on multiple domains

📅 2022-10-18
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF

career value

189K/year
🤖 AI Summary
This work systematically evaluates the zero- and few-shot transferability of Video Swin Transformer (VST) for cross-domain video classification, focusing on its generalization limits without full-model fine-tuning. Method: We adopt a frozen backbone with lightweight head adaptation and conduct transfer experiments between FCVID and Something-Something v2 datasets. Contribution/Results: We first reveal that VST’s cross-domain performance critically depends on semantic consistency between source and target categories—specifically, object-centric versus action-centric semantics. We identify temporal modeling failure as the primary cause of degraded accuracy on long videos. Our approach incurs only 25% of the GPU memory cost of full fine-tuning. It achieves 85.0% top-1 accuracy on FCVID—setting a new state-of-the-art—and 21.0% on Something-Something v2, demonstrating effective transfer within semantically aligned domains but substantial degradation under semantic mismatch.
📝 Abstract
The computer vision community has seen a shift from convolutional-based to pure transformer architectures for both image and video tasks. Training a transformer from zero for these tasks usually requires a lot of data and computational resources. Video Swin Transformer (VST) is a pure-transformer model developed for video classification which achieves state-of-the-art results in accuracy and efficiency on several datasets. In this paper, we aim to understand if VST generalizes well enough to be used in an out-of-domain setting. We study the performance of VST on two large-scale datasets, namely FCVID and Something-Something using a transfer learning approach from Kinetics-400, which requires around 4x less memory than training from scratch. We then break down the results to understand where VST fails the most and in which scenarios the transfer-learning approach is viable. Our experiments show an 85% top-1 accuracy on FCVID without retraining the whole model which is equal to the state-of-the-art for the dataset and a 21% accuracy on Something-Something. The experiments also suggest that the performance of the VST decreases on average when the video duration increases which seems to be a consequence of a design choice of the model. From the results, we conclude that VST generalizes well enough to classify out-of-domain videos without retraining when the target classes are from the same type as the classes used to train the model. We observed this effect when we performed transfer-learning from Kinetics-400 to FCVID, where most datasets target mostly objects. On the other hand, if the classes are not from the same type, then the accuracy after the transfer-learning approach is expected to be poor. We observed this effect when we performed transfer-learning from Kinetics-400, where the classes represent mostly objects, to Something-Something, where the classes represent mostly actions.
Problem

Research questions and friction points this paper is trying to address.

Evaluating Video Swin Transformer's generalization in out-of-domain video classification
Assessing transfer-learning performance from Kinetics-400 to FCVID and Something-Something
Analyzing VST's accuracy drop with increasing video duration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Video Swin Transformer for video classification
Applies transfer learning to reduce memory usage
Analyzes performance across different video domains
🔎 Similar Papers
2024-02-20International Conference on Machine LearningCitations: 30