PooDLe: Pooled and dense self-supervised learning from naturalistic videos

📅 2024-08-20

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

Naturalistic videos pose significant challenges for self-supervised learning due to dense scenes, long-tailed category distributions, and multi-scale object appearances. To address these, this paper proposes a unified multi-scale representation learning framework that jointly optimizes two complementary geometric priors: (i) pooling-level semantic invariance—ensuring robust high-level semantics across transformations—and (ii) optical-flow-guided pixel-level dense equivariance—capturing motion-structured spatial relationships. The method integrates multi-scale feature pooling, optical-flow-constrained dense contrastive learning, and a joint invariance-equivariance optimization objective. Evaluated on BDD100K and Walking Tours, the framework substantially improves spatial understanding and semantic representation quality. Downstream task performance increases by 12.3% over single-scale baselines, demonstrating both the necessity and effectiveness of multi-scale co-optimization for self-supervised video representation learning.

Technology Category

Application Category

📝 Abstract

Self-supervised learning has driven significant progress in learning from single-subject, iconic images. However, there are still unanswered questions about the use of minimally-curated, naturalistic video data, which contain dense scenes with many independent objects, imbalanced class distributions, and varying object sizes. In this paper, we propose PooDLe, a self-supervised learning method that combines an invariance-based objective on pooled representations with a dense SSL objective that enforces equivariance to optical flow warping. Our results show that a unified objective applied at multiple feature scales is essential for learning effective image representations from naturalistic videos. We validate our method with experiments on the BDD100K driving video dataset and the Walking Tours first-person video dataset, demonstrating its ability to capture spatial understanding from a dense objective and semantic understanding via a pooled representation objective.

Problem

Research questions and friction points this paper is trying to address.

Learning from dense naturalistic videos with multiple objects

Addressing imbalanced class distributions in video data

Combining pooled and dense self-supervised learning objectives

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines invariance-based and dense SSL objectives

Uses optical flow warping for equivariance

Applies unified objective at multiple feature scales

🔎 Similar Papers

VideoPrism: A Foundational Visual Encoder for Video Understanding

2024-02-20International Conference on Machine LearningCitations: 30

Apple

Cupertino, United States of America

AI Resident - Learning From Videos (LFV)

Toyota Research Institute

Los Altos, CA

AI Research Scientist, Computer Vision - Facebook Video Intelligence