SemanticGen: Video Generation in Semantic Space

📅 2025-12-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video generation models directly model pixel-level tokens in the VAE latent space, resulting in slow convergence and high computational overhead for long-video training. To address this, we propose a novel semantic-space video generation paradigm: (1) In the first stage, a diffusion model operates within a learnable, high-level semantic feature space—rather than the raw VAE latent space—to generate compact, structured semantic plans; (2) In the second stage, VAE latent variables are synthesized conditionally on these semantic plans and subsequently decoded into video frames, enabling hierarchical, semantically grounded pixel reconstruction. This is the first video generation framework to apply diffusion modeling in a learnable semantic feature space. Experiments demonstrate state-of-the-art performance across multiple benchmarks, with significant improvements in visual quality, ~40% faster training convergence, and efficient long-video synthesis.

Technology Category

Application Category

📝 Abstract
State-of-the-art video generative models typically learn the distribution of video latents in the VAE space and map them to pixels using a VAE decoder. While this approach can generate high-quality videos, it suffers from slow convergence and is computationally expensive when generating long videos. In this paper, we introduce SemanticGen, a novel solution to address these limitations by generating videos in the semantic space. Our main insight is that, due to the inherent redundancy in videos, the generation process should begin in a compact, high-level semantic space for global planning, followed by the addition of high-frequency details, rather than directly modeling a vast set of low-level video tokens using bi-directional attention. SemanticGen adopts a two-stage generation process. In the first stage, a diffusion model generates compact semantic video features, which define the global layout of the video. In the second stage, another diffusion model generates VAE latents conditioned on these semantic features to produce the final output. We observe that generation in the semantic space leads to faster convergence compared to the VAE latent space. Our method is also effective and computationally efficient when extended to long video generation. Extensive experiments demonstrate that SemanticGen produces high-quality videos and outperforms state-of-the-art approaches and strong baselines.
Problem

Research questions and friction points this paper is trying to address.

Generates videos in semantic space for efficiency
Improves convergence speed over VAE latent methods
Enables computationally efficient long video generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage diffusion model for video generation
Generates in semantic space for global planning
Conditional VAE latents for high-frequency details
🔎 Similar Papers
No similar papers found.