🤖 AI Summary
This work systematically evaluates the generalization capabilities of video self-supervised learning models under realistic conditions, focusing on four downstream challenges: domain shift, sample efficiency, action granularity, and task diversity. Using a unified pretraining-finetuning protocol, we conduct over 1,100 experiments across 8 datasets and 7 task categories, benchmarking 12 Transformer-based and 10 CNN-based architectures—the first such sensitivity analysis extended to video-specific and video-text Transformers. Key findings include: architectural advancements do not improve generalization robustness; video-only models exhibit greater resilience to domain shift, CNNs excel at fine-grained action recognition, and video-text models suffer from a decoupling between pretraining scale and downstream performance. We introduce VidGenBench—the first comprehensive, generalization-oriented evaluation benchmark for video representation learning—and an open, reproducible experimental library. Our analysis reveals significant benchmark sensitivity across all models, with no single method consistently dominating across all dimensions.
📝 Abstract
Continued advances in self-supervised learning have led to significant progress in video representation learning, offering a scalable alternative to supervised approaches by removing the need for manual annotations. Despite strong performance on standard action recognition benchmarks, video self-supervised learning methods are largely evaluated under narrow protocols, typically pretraining on Kinetics-400 and fine-tuning on similar datasets, limiting our understanding of their generalization in real world scenarios. In this work, we present a comprehensive evaluation of modern video self-supervised models, focusing on generalization across four key downstream factors: domain shift, sample efficiency, action granularity, and task diversity. Building on our prior work analyzing benchmark sensitivity in CNN-based contrastive learning, we extend the study to cover state-of-the-art transformer-based video-only and video-text models. Specifically, we benchmark 12 transformer-based methods (7 video-only, 5 video-text) and compare them to 10 CNN-based methods, totaling over 1100 experiments across 8 datasets and 7 downstream tasks. Our analysis shows that, despite architectural advances, transformer-based models remain sensitive to downstream conditions. No method generalizes consistently across all factors, video-only transformers perform better under domain shifts, CNNs outperform for fine-grained tasks, and video-text models often underperform despite large scale pretraining. We also find that recent transformer models do not consistently outperform earlier approaches. Our findings provide a detailed view of the strengths and limitations of current video SSL methods and offer a unified benchmark for evaluating generalization in video representation learning.