Evolution of Video Generative Foundations

📅 2026-04-07

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This study addresses a critical gap in existing video generation surveys, which predominantly focus on specific techniques—such as GANs or diffusion models—or narrow tasks like video editing, while overlooking the broader evolutionary trajectory, particularly the systematic integration of autoregressive modeling and multimodal fusion. To bridge this gap, this work presents the first comprehensive synthesis of technical advancements spanning generative adversarial networks, diffusion models, and emerging autoregressive and multimodal approaches. It offers an in-depth analysis of their underlying principles, strengths, and limitations, and establishes a unified analytical framework that connects historical foundations with cutting-edge developments. Emphasizing autoregressive architectures and multimodal information integration as pivotal emerging directions, this research provides theoretical grounding for applications in virtual reality, personalized education, and autonomous driving simulation, thereby advancing the development of sophisticated world models and digital content generation.

Technology Category

Application Category

📝 Abstract

The rapid advancement of Artificial Intelligence Generated Content (AIGC) has revolutionized video generation, enabling systems ranging from proprietary pioneers like OpenAI's Sora, Google's Veo3, and Bytedance's Seedance to powerful open-source contenders like Wan and HunyuanVideo to synthesize temporally coherent and semantically rich videos. These advancements pave the way for building "world models" that simulate real-world dynamics, with applications spanning entertainment, education, and virtual reality. However, existing reviews on video generation often focus on narrow technical fields, e.g., Generative Adversarial Networks (GAN) and diffusion models, or specific tasks (e. g., video editing), lacking a comprehensive perspective on the field's evolution, especially regarding Auto-Regressive (AR) models and integration of multimodal information. To address these gaps, this survey firstly provides a systematic review of the development of video generation technology, tracing its evolution from early GANs to dominant diffusion models, and further to emerging AR-based and multimodal techniques. We conduct an in-depth analysis of the foundational principles, key advancements, and comparative strengths/limitations. Then, we explore emerging trends in multimodal video generation, emphasizing the integration of diverse data types to enhance contextual awareness. Finally, by bridging historical developments and contemporary innovations, this survey offers insights to guide future research in video generation and its applications, including virtual/augmented reality, personalized education, autonomous driving simulations, digital entertainment, and advanced world models, in this rapidly evolving field. For more details, please refer to the project at https://github.com/sjtuplayer/Awesome-Video-Foundations.

Problem

Research questions and friction points this paper is trying to address.

video generation

survey

auto-regressive models

multimodal integration

AIGC

Innovation

Methods, ideas, or system contributions that make the work stand out.

video generation

auto-regressive models

multimodal integration