Learning from Next-Frame Prediction: Autoregressive Video Modeling Encodes Effective Representations

📅 2025-12-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing autoregressive (AR) vision pretraining methods face three key bottlenecks in video modeling: weak temporal modeling, inaccurate semantic localization, and poor generation quality. To address these, we propose NExT-Vid—a novel framework enabling joint image-video masked next-frame autoregressive pretraining for the first time. Its core innovations are: (1) a context-isolated autoregressive predictor that decouples semantic representation learning from pixel-level reconstruction; and (2) a conditional flow-matching decoder that enhances generative diversity and fidelity. Extensive large-scale pretraining demonstrates that NExT-Vid consistently outperforms BERT-style and state-of-the-art AR vision models across multiple downstream video classification benchmarks. These results validate its unified representational capability—achieving strong generalization, high discriminability, and high-fidelity generation simultaneously.

Technology Category

Application Category

📝 Abstract
Recent advances in pretraining general foundation models have significantly improved performance across diverse downstream tasks. While autoregressive (AR) generative models like GPT have revolutionized NLP, most visual generative pretraining methods still rely on BERT-style masked modeling, which often disregards the temporal information essential for video analysis. The few existing autoregressive visual pretraining methods suffer from issues such as inaccurate semantic localization and poor generation quality, leading to poor semantics. In this work, we propose NExT-Vid, a novel autoregressive visual generative pretraining framework that utilizes masked next-frame prediction to jointly model images and videos. NExT-Vid introduces a context-isolated autoregressive predictor to decouple semantic representation from target decoding, and a conditioned flow-matching decoder to enhance generation quality and diversity. Through context-isolated flow-matching pretraining, our approach achieves strong representations. Extensive experiments on large-scale pretrained models demonstrate that our proposed method consistently outperforms previous generative pretraining methods for visual representation learning via attentive probing in downstream classification.
Problem

Research questions and friction points this paper is trying to address.

Autoregressive video modeling encodes effective temporal representations
Autoregressive visual pretraining suffers from poor semantic localization and generation
Proposes NExT-Vid for masked next-frame prediction to enhance representation learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Autoregressive masked next-frame prediction for video modeling
Context-isolated predictor decouples semantics from decoding
Conditioned flow-matching decoder enhances generation quality diversity
🔎 Similar Papers
No similar papers found.
Jinghan Li
Jinghan Li
University of Science and Technology of China
Y
Yang Jin
Peking University
H
Hao Jiang
Peking University
Yadong Mu
Yadong Mu
Peking University
Computer VisionRoboticsMachine Learning
Y
Yang Song
Peking University
K
Kun Xu
Peking University