🤖 AI Summary
This work investigates the efficacy of autoregressive modeling for general-purpose visual representation learning. We propose a purely autoregressive pretraining paradigm: videos and images are uniformly tokenized into visual sequences via VQVAE, and a Transformer is trained to predict future visual tokens over a mixed dataset exceeding one trillion tokens. This constitutes the first systematic validation of the feasibility of minimal-inductive-bias, purely autoregressive architectures in vision. We observe language-model-like scaling laws but markedly distinct convergence dynamics. Our method integrates vector-quantized tokenization, cross-modal joint pretraining, and long-sequence modeling. On downstream tasks—including image classification, video action recognition, object tracking, and robotic perception—it achieves state-of-the-art or competitive performance. These results establish a novel pathway toward unified, scalable, general-purpose visual models.
📝 Abstract
We empirically study autoregressive pre-training from videos. To perform our study, we construct a series of autoregressive video models, called Toto. We treat videos as sequences of visual tokens and train transformer models to autoregressively predict future tokens. Our models are pre-trained on a diverse dataset of videos and images comprising over 1 trillion visual tokens. We explore different architectural, training, and inference design choices. We evaluate the learned visual representations on a range of downstream tasks including image recognition, video classification, object tracking, and robotics. Our results demonstrate that, despite minimal inductive biases, autoregressive pre-training leads to competitive performance across all benchmarks. Finally, we find that scaling our video models results in similar scaling curves to those seen in language models, albeit with a different rate. More details at https://brjathu.github.io/toto/