Scaling 4D Representations

📅 2024-12-19

🏛️ arXiv.org

📈 Citations: 5

✨ Influential: 0

career value

215K/year

🤖 AI Summary

Existing purely self-supervised video learning approaches lack systematic validation of scalability for non-semantic 4D vision tasks—such as camera pose estimation, point/object tracking, and depth estimation. This paper introduces a Transformer-based video model grounded in masked autoencoding (MAE), trained on large-scale video data across a rigorously controlled, multi-scale ablation framework. We scale model parameters from 20M to 22B—the largest purely self-supervised video models to date. For the first time, we empirically demonstrate strong scalability of such representations for 4D tasks: performance improves consistently with model size. Our approach significantly outperforms prior image- and video-based self-supervised methods on multiple non-semantic benchmarks, establishing new state-of-the-art results in camera pose estimation, tracking, and depth estimation.

Technology Category

Application Category

📝 Abstract

Scaling has not yet been convincingly demonstrated for pure self-supervised learning from video. However, prior work has focused evaluations on semantic-related tasks $unicode{x2013}$ action classification, ImageNet classification, etc. In this paper we focus on evaluating self-supervised learning on non-semantic vision tasks that are more spatial (3D) and temporal (+1D = 4D), such as camera pose estimation, point and object tracking, and depth estimation. We show that by learning from very large video datasets, masked auto-encoding (MAE) with transformer video models actually scales, consistently improving performance on these 4D tasks, as model size increases from 20M all the way to the largest by far reported self-supervised video model $unicode{x2013}$ 22B parameters. Rigorous apples-to-apples comparison with many recent image and video models demonstrates the benefits of scaling 4D representations.

Problem

Research questions and friction points this paper is trying to address.

Evaluating self-supervised learning on non-semantic 4D vision tasks

Scaling masked auto-encoding with transformer video models

Improving performance on spatial-temporal tasks like pose estimation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses masked auto-encoding (MAE) for video

Scales transformer models to 22B parameters

Focuses on 4D spatial-temporal vision tasks

🔎 Similar Papers

No similar papers found.