An Empirical Study of Autoregressive Pre-training from Videos

📅 2025-01-09

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This work investigates the efficacy of autoregressive modeling for general-purpose visual representation learning. We propose a purely autoregressive pretraining paradigm: videos and images are uniformly tokenized into visual sequences via VQVAE, and a Transformer is trained to predict future visual tokens over a mixed dataset exceeding one trillion tokens. This constitutes the first systematic validation of the feasibility of minimal-inductive-bias, purely autoregressive architectures in vision. We observe language-model-like scaling laws but markedly distinct convergence dynamics. Our method integrates vector-quantized tokenization, cross-modal joint pretraining, and long-sequence modeling. On downstream tasks—including image classification, video action recognition, object tracking, and robotic perception—it achieves state-of-the-art or competitive performance. These results establish a novel pathway toward unified, scalable, general-purpose visual models.

Technology Category

Application Category

📝 Abstract

We empirically study autoregressive pre-training from videos. To perform our study, we construct a series of autoregressive video models, called Toto. We treat videos as sequences of visual tokens and train transformer models to autoregressively predict future tokens. Our models are pre-trained on a diverse dataset of videos and images comprising over 1 trillion visual tokens. We explore different architectural, training, and inference design choices. We evaluate the learned visual representations on a range of downstream tasks including image recognition, video classification, object tracking, and robotics. Our results demonstrate that, despite minimal inductive biases, autoregressive pre-training leads to competitive performance across all benchmarks. Finally, we find that scaling our video models results in similar scaling curves to those seen in language models, albeit with a different rate. More details at https://brjathu.github.io/toto/

Problem

Research questions and friction points this paper is trying to address.

Video Prediction

Model Performance

Computer Vision Tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Video Prediction Models

Unsupervised Learning

Large-scale Dataset

🔎 Similar Papers

VideoPrism: A Foundational Visual Encoder for Video Understanding

2024-02-20International Conference on Machine LearningCitations: 30

TikTok

San Jose, California

AI Research Scientist, Computer Vision - Facebook Video Intelligence