🤖 AI Summary
Existing autoregressive large language models (LLMs) for video generation are severely limited in duration, typically producing only a few seconds of video, and thus fail to capture temporal consistency and fine-grained motion coherence required for minute-long videos. To address this, we propose the first autoregressive long-video generation framework based on a unified text-video token sequence. Our approach comprises three key innovations: (1) constructing a cross-modal unified tokenization space to jointly model text and video; (2) introducing a progressive short-to-long training strategy with dynamic loss reweighting to alleviate long-range dependency modeling challenges; and (3) incorporating dynamic video token recoding and optimized sampling (temperature scaling + top-k) to suppress error accumulation during autoregressive inference. Remarkably, trained solely on 10-second video clips, our model generates high-fidelity, semantically aligned, and temporally consistent 60-second videos—significantly outperforming existing baselines in text-video alignment and motion coherence.
📝 Abstract
It is desirable but challenging to generate content-rich long videos in the scale of minutes. Autoregressive large language models (LLMs) have achieved great success in generating coherent and long sequences of tokens in the domain of natural language processing, while the exploration of autoregressive LLMs for video generation is limited to generating short videos of several seconds. In this work, we conduct a deep analysis of the challenges that prevent autoregressive LLM-based video generators from generating long videos. Based on the observations and analysis, we propose Loong, a new autoregressive LLM-based video generator that can generate minute-long videos. Specifically, we model the text tokens and video tokens as a unified sequence for autoregressive LLMs and train the model from scratch. We propose progressive short-to-long training with a loss re-weighting scheme to mitigate the loss imbalance problem for long video training. We further investigate inference strategies, including video token re-encoding and sampling strategies, to diminish error accumulation during inference. Our proposed Loong can be trained on 10-second videos and be extended to generate minute-level long videos conditioned on text prompts, as demonstrated by the results. More samples are available at: https://yuqingwang1029.github.io/Loong-video.