🤖 AI Summary
This work systematically evaluates the zero- and few-shot transferability of Video Swin Transformer (VST) for cross-domain video classification, focusing on its generalization limits without full-model fine-tuning.
Method: We adopt a frozen backbone with lightweight head adaptation and conduct transfer experiments between FCVID and Something-Something v2 datasets.
Contribution/Results: We first reveal that VST’s cross-domain performance critically depends on semantic consistency between source and target categories—specifically, object-centric versus action-centric semantics. We identify temporal modeling failure as the primary cause of degraded accuracy on long videos. Our approach incurs only 25% of the GPU memory cost of full fine-tuning. It achieves 85.0% top-1 accuracy on FCVID—setting a new state-of-the-art—and 21.0% on Something-Something v2, demonstrating effective transfer within semantically aligned domains but substantial degradation under semantic mismatch.
📝 Abstract
The computer vision community has seen a shift from convolutional-based to pure transformer architectures for both image and video tasks. Training a transformer from zero for these tasks usually requires a lot of data and computational resources. Video Swin Transformer (VST) is a pure-transformer model developed for video classification which achieves state-of-the-art results in accuracy and efficiency on several datasets. In this paper, we aim to understand if VST generalizes well enough to be used in an out-of-domain setting. We study the performance of VST on two large-scale datasets, namely FCVID and Something-Something using a transfer learning approach from Kinetics-400, which requires around 4x less memory than training from scratch. We then break down the results to understand where VST fails the most and in which scenarios the transfer-learning approach is viable. Our experiments show an 85% top-1 accuracy on FCVID without retraining the whole model which is equal to the state-of-the-art for the dataset and a 21% accuracy on Something-Something. The experiments also suggest that the performance of the VST decreases on average when the video duration increases which seems to be a consequence of a design choice of the model. From the results, we conclude that VST generalizes well enough to classify out-of-domain videos without retraining when the target classes are from the same type as the classes used to train the model. We observed this effect when we performed transfer-learning from Kinetics-400 to FCVID, where most datasets target mostly objects. On the other hand, if the classes are not from the same type, then the accuracy after the transfer-learning approach is expected to be poor. We observed this effect when we performed transfer-learning from Kinetics-400, where the classes represent mostly objects, to Something-Something, where the classes represent mostly actions.