Scaling 4D Representations

📅 2024-12-19
🏛️ arXiv.org
📈 Citations: 5
Influential: 0
📄 PDF
🤖 AI Summary
Existing purely self-supervised video learning approaches lack systematic validation of scalability for non-semantic 4D vision tasks—such as camera pose estimation, point/object tracking, and depth estimation. This paper introduces a Transformer-based video model grounded in masked autoencoding (MAE), trained on large-scale video data across a rigorously controlled, multi-scale ablation framework. We scale model parameters from 20M to 22B—the largest purely self-supervised video models to date. For the first time, we empirically demonstrate strong scalability of such representations for 4D tasks: performance improves consistently with model size. Our approach significantly outperforms prior image- and video-based self-supervised methods on multiple non-semantic benchmarks, establishing new state-of-the-art results in camera pose estimation, tracking, and depth estimation.

Technology Category

Application Category

📝 Abstract
Scaling has not yet been convincingly demonstrated for pure self-supervised learning from video. However, prior work has focused evaluations on semantic-related tasks $unicode{x2013}$ action classification, ImageNet classification, etc. In this paper we focus on evaluating self-supervised learning on non-semantic vision tasks that are more spatial (3D) and temporal (+1D = 4D), such as camera pose estimation, point and object tracking, and depth estimation. We show that by learning from very large video datasets, masked auto-encoding (MAE) with transformer video models actually scales, consistently improving performance on these 4D tasks, as model size increases from 20M all the way to the largest by far reported self-supervised video model $unicode{x2013}$ 22B parameters. Rigorous apples-to-apples comparison with many recent image and video models demonstrates the benefits of scaling 4D representations.
Problem

Research questions and friction points this paper is trying to address.

Evaluating self-supervised learning on non-semantic 4D vision tasks
Scaling masked auto-encoding with transformer video models
Improving performance on spatial-temporal tasks like pose estimation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses masked auto-encoding (MAE) for video
Scales transformer models to 22B parameters
Focuses on 4D spatial-temporal vision tasks
🔎 Similar Papers
No similar papers found.
J
João Carreira
Google DeepMind
Dilara Gokay
Dilara Gokay
Google DeepMind
Computer Vision
M
Michael King
Google DeepMind
Chuhan Zhang
Chuhan Zhang
Hong Kong University of Science and Technology
computer vision
Ignacio Rocco
Ignacio Rocco
Google DeepMind
Computer vision
Aravindh Mahendran
Aravindh Mahendran
Google Deepmind
Self-supervised learningObject centric learningComputer vision
T
T. Keck
Google DeepMind
J
Joseph Heyward
Google DeepMind
Skanda Koppula
Skanda Koppula
Google DeepMind
Embedded SystemsComputer VisionSpeech RecognitionBioinformatics
E
Etienne Pot
Google DeepMind
G
Goker Erdogan
Google DeepMind
Y
Yana Hasson
Google DeepMind
Y
Yi Yang
Google DeepMind
Klaus Greff
Klaus Greff
Research Scientist at Google Brain
Machine LearningNeural Networks
G
G. L. Moing
Google DeepMind
Sjoerd van Steenkiste
Sjoerd van Steenkiste
Research Scientist at Google Research
Artificial IntelligenceMachine LearningDeep Learning
Daniel Zoran
Daniel Zoran
Google DeepMind
Computer VisionNatural Scene StatisticsMachine LearningComputational Neuroscience
Drew A. Hudson
Drew A. Hudson
Stanford University
Deep LearningArtificial IntelligenceReasoning
P
Pedro V'elez
Google DeepMind
L
Luisa F. Polan'ia
Google DeepMind
L
Luke Friedman
Google DeepMind
C
Chris Duvarney
Google DeepMind
Ross Goroshin
Ross Goroshin
Google DeepMind
Machine LearningFeature LearningUnsupervised Learning
Kelsey Allen
Kelsey Allen
Research Scientist, DeepMind
Artificial IntelligenceCognitive ScienceComputational NeuroscienceCollective BehaviorPhysics
J
Jacob Walker
Google DeepMind
Rishabh Kabra
Rishabh Kabra
Google DeepMind, University College London
Unsupervised learningCausalityComputer VisionReinforcement Learning
E
E. Aboussouan
Google DeepMind
J
Jennifer Sun
Google DeepMind
T
Thomas Kipf
Google DeepMind
Carl Doersch
Carl Doersch
Google DeepMind
Computer VisionMachine Learning
V
Viorica Puatruaucean
Google DeepMind
D
D. Damen
University of Bristol
Pauline Luc
Pauline Luc
Google Deepmind
deep learninggenerative modelingimage recognition
Mehdi S. M. Sajjadi
Mehdi S. M. Sajjadi
Research Scientist, Google DeepMind
Machine LearningDeep LearningGenerative ModelsComputational Imaging
Andrew Zisserman
Andrew Zisserman
University of Oxford
Computer VisionMachine Learning