SEVERE++: Evaluating Benchmark Sensitivity in Generalization of Video Representation Learning

📅 2025-04-08

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work systematically evaluates the generalization capabilities of video self-supervised learning models under realistic conditions, focusing on four downstream challenges: domain shift, sample efficiency, action granularity, and task diversity. Using a unified pretraining-finetuning protocol, we conduct over 1,100 experiments across 8 datasets and 7 task categories, benchmarking 12 Transformer-based and 10 CNN-based architectures—the first such sensitivity analysis extended to video-specific and video-text Transformers. Key findings include: architectural advancements do not improve generalization robustness; video-only models exhibit greater resilience to domain shift, CNNs excel at fine-grained action recognition, and video-text models suffer from a decoupling between pretraining scale and downstream performance. We introduce VidGenBench—the first comprehensive, generalization-oriented evaluation benchmark for video representation learning—and an open, reproducible experimental library. Our analysis reveals significant benchmark sensitivity across all models, with no single method consistently dominating across all dimensions.

Technology Category

Application Category

📝 Abstract

Continued advances in self-supervised learning have led to significant progress in video representation learning, offering a scalable alternative to supervised approaches by removing the need for manual annotations. Despite strong performance on standard action recognition benchmarks, video self-supervised learning methods are largely evaluated under narrow protocols, typically pretraining on Kinetics-400 and fine-tuning on similar datasets, limiting our understanding of their generalization in real world scenarios. In this work, we present a comprehensive evaluation of modern video self-supervised models, focusing on generalization across four key downstream factors: domain shift, sample efficiency, action granularity, and task diversity. Building on our prior work analyzing benchmark sensitivity in CNN-based contrastive learning, we extend the study to cover state-of-the-art transformer-based video-only and video-text models. Specifically, we benchmark 12 transformer-based methods (7 video-only, 5 video-text) and compare them to 10 CNN-based methods, totaling over 1100 experiments across 8 datasets and 7 downstream tasks. Our analysis shows that, despite architectural advances, transformer-based models remain sensitive to downstream conditions. No method generalizes consistently across all factors, video-only transformers perform better under domain shifts, CNNs outperform for fine-grained tasks, and video-text models often underperform despite large scale pretraining. We also find that recent transformer models do not consistently outperform earlier approaches. Our findings provide a detailed view of the strengths and limitations of current video SSL methods and offer a unified benchmark for evaluating generalization in video representation learning.

Problem

Research questions and friction points this paper is trying to address.

Evaluating generalization of video self-supervised learning methods

Assessing sensitivity to domain shift and task diversity

Comparing CNN and transformer-based models across benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates video self-supervised models comprehensively

Benchmarks 12 transformer-based and 10 CNN-based methods

Analyzes generalization across domain shift and tasks

🔎 Similar Papers

VideoPrism: A Foundational Visual Encoder for Video Understanding