Video-GPT via Next Clip Diffusion

πŸ“… 2025-05-18
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the challenge of precisely modeling spatiotemporal dynamics in videos using natural language, this paper proposes Next-Clip Diffusionβ€”a novel autoregressive pretraining paradigm that treats video as a visual language. Unlike conventional token-level modeling, our approach unifies long-horizon spatiotemporal modeling and short-horizon generation via cross-clip noise prediction, ensuring temporal coherence and spatial consistency. Technically, it integrates diffusion modeling, video-specific autoregressive architecture, and multi-task adaptive fine-tuning. On the Physics-IQ video prediction benchmark, it achieves 34.97 (SOTA), substantially outperforming Kling (23.64) and Wan (20.89). Moreover, the learned representations generalize effectively across six diverse downstream video generation and understanding tasks, demonstrating strong universal representational capability.

Technology Category

Application Category

πŸ“ Abstract
GPT has shown its remarkable success in natural language processing. However, the language sequence is not sufficient to describe spatial-temporal details in the visual world. Alternatively, the video sequence is good at capturing such details. Motivated by this fact, we propose a concise Video-GPT in this paper by treating video as new language for visual world modeling. By analogy to next token prediction in GPT, we introduce a novel next clip diffusion paradigm for pretraining Video-GPT. Different from the previous works, this distinct paradigm allows Video-GPT to tackle both short-term generation and long-term prediction, by autoregressively denoising the noisy clip according to the clean clips in the history. Extensive experiments show our Video-GPT achieves the state-of-the-art performance on video prediction, which is the key factor towards world modeling (Physics-IQ Benchmark: Video-GPT 34.97 vs. Kling 23.64 vs. Wan 20.89). Moreover, it can be well adapted on 6 mainstream video tasks in both video generation and understanding, showing its great generalization capacity in downstream. The project page is at https://Video-GPT.github.io.
Problem

Research questions and friction points this paper is trying to address.

Modeling visual world using video as new language
Enabling short-term generation and long-term video prediction
Achieving state-of-the-art performance on video tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Video-GPT treats video as new language
Next clip diffusion for pretraining Video-GPT
Autoregressive denoising for short and long term
πŸ”Ž Similar Papers
No similar papers found.
Shaobin Zhuang
Shaobin Zhuang
Shanghai Jiaotong University
Video GenerationComputer Vision
Zhipeng Huang
Zhipeng Huang
Microsoft Research Asia && University of Science and Technology of China
Multi-ModalityComputer Vision
Y
Ying Zhang
WeChat Vision, Tencent Inc.
Fangyikang Wang
Fangyikang Wang
Zhejiang University
Diffusion ModelsOptimal TransportOptimization
C
Canmiao Fu
WeChat Vision, Tencent Inc.
B
Binxin Yang
WeChat Vision, Tencent Inc.
Chong Sun
Chong Sun
Tencent WeChat
Computer Vision
C
Chen Li
WeChat Vision, Tencent Inc.
Y
Yali Wang
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shanghai AI Laboratory