Advancing Auto-Regressive Continuation for Video Frames

πŸ“… 2024-12-04
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 1
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the degradation in long-sequence modeling and declining visual fidelity in autoregressive video frame generation, this paper proposes ARCONβ€”a novel framework that pioneers alternating autoregressive generation of semantic tokens and RGB tokens, empirically revealing their intrinsic high consistency. ARCON further introduces a flow-guided texture stitching mechanism to substantially enhance temporal coherence. The method integrates multimodal large language model-based tokenization, alternating token-level modeling, optical flow estimation, and texture fusion. Evaluated in autonomous driving scenarios, ARCON stably generates high-fidelity videos spanning over 100 frames. It achieves significant improvements over state-of-the-art baselines in both quantitative metrics (FVD and LPIPS) and qualitative visual quality. By unifying semantic and pixel-level modeling with motion-aware synthesis, ARCON establishes a new paradigm for world model construction and future frame prediction.

Technology Category

Application Category

πŸ“ Abstract
Recent advances in auto-regressive large language models (LLMs) have shown their potential in generating high-quality text, inspiring researchers to apply them to image and video generation. This paper explores the application of LLMs to video continuation, a task essential for building world models and predicting future frames. In this paper, we tackle challenges including preventing degeneration in long-term frame generation and enhancing the quality of generated images. We design a scheme named ARCON, which involves training our model to alternately generate semantic tokens and RGB tokens, enabling the LLM to explicitly learn and predict the high-level structural information of the video. We find high consistency in the RGB images and semantic maps generated without special design. Moreover, we employ an optical flow-based texture stitching method to enhance the visual quality of the generated videos. Quantitative and qualitative experiments in autonomous driving scenarios demonstrate our model can consistently generate long videos.
Problem

Research questions and friction points this paper is trying to address.

Enhance video continuation using Large Vision Models
Generate semantic and RGB tokens alternately
Improve visual quality with optical flow-based stitching
Innovation

Methods, ideas, or system contributions that make the work stand out.

ARCON alternates semantic and RGB tokens
Optical flow-based texture stitching method
Generates long videos in autonomous driving
πŸ”Ž Similar Papers
R
RuiBo Ming
Tsinghua University, Megvii Technology
J
Jingwei Wu
University of the Chinese Academy of Sciences
Z
Zhewei Huang
University of the Chinese Academy of Sciences
Z
Zhuoxuan Ju
Georgetown University
Jianming Hu
Jianming Hu
penn state university
virologymolecular biology
L
Lihui Peng
Tsinghua University
Shuchang Zhou
Shuchang Zhou
Megvii Inc.
Artificial Intelligence