Self-supervised video pretraining yields robust and more human-aligned visual representations

📅 2022-10-12
🏛️ Neural Information Processing Systems
📈 Citations: 9
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates whether video self-supervised pretraining can yield general-purpose visual representations closer to human visual understanding. To this end, we propose VITO: a framework leveraging automatically curated high-quality video data, incorporating temporal-transformation-aware contrastive learning objectives, and introducing a self-supervised representation distillation mechanism to extract robust, generalizable visual knowledge from video temporal structures. We present the first systematic evidence that video pretraining surpasses state-of-the-art image-only pretraining on pure image tasks; achieves comprehensive superiority across both image and video understanding benchmarks; exhibits significantly enhanced robustness against natural and synthetic distortions—outperforming models trained via image-based, video-based, or adversarial methods; and attains a new record in alignment with human visual judgments, exceeding even models explicitly designed for this objective.
📝 Abstract
Humans learn powerful representations of objects and scenes by observing how they evolve over time. Yet, outside of specific tasks that require explicit temporal understanding, static image pretraining remains the dominant paradigm for learning visual foundation models. We question this mismatch, and ask whether video pretraining can yield visual representations that bear the hallmarks of human perception: generalisation across tasks, robustness to perturbations, and consistency with human judgements. To that end we propose a novel procedure for curating videos, and develop a contrastive framework which learns from the complex transformations therein. This simple paradigm for distilling knowledge from videos, called VITO, yields general representations that far outperform prior video pretraining methods on image understanding tasks, and image pretraining methods on video understanding tasks. Moreover, VITO representations are significantly more robust to natural and synthetic deformations than image-, video-, and adversarially-trained ones. Finally, VITO's predictions are strongly aligned with human judgements, surpassing models that were specifically trained for that purpose. Together, these results suggest that video pretraining could be a simple way of learning unified, robust, and human-aligned representations of the visual world.
Problem

Research questions and friction points this paper is trying to address.

Self-supervised Learning
Visual Understanding
Human-like Perception
Innovation

Methods, ideas, or system contributions that make the work stand out.

VITO
Self-supervised Video Pre-training
Human-like Visual Perception
🔎 Similar Papers
2024-02-20International Conference on Machine LearningCitations: 30