CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

📅 2024-08-12
🏛️ arXiv.org
📈 Citations: 261
Influential: 76
📄 PDF
🤖 AI Summary
Existing text-to-video models suffer from monotonous motion generation, limited video duration (typically ≤2 seconds), and poor narrative coherence. To address these challenges, we propose the first end-to-end framework enabling high-fidelity 10-second, 16-fps video generation at 768×1360 resolution. Our method introduces three core innovations: (1) a 3D causal VAE that explicitly models spatiotemporal causality to enhance motion dynamics; (2) an MoE Transformer with expert-adaptive LayerNorm, improving long-range temporal modeling and text–video alignment; and (3) a progressive multi-resolution frame-packing strategy that balances visual fidelity and computational efficiency. Integrated with a diffusion-based Transformer architecture, multi-stage data preprocessing, and an automated captioning pipeline, our approach achieves state-of-the-art performance on both automatic metrics (e.g., FVD, CLIP-Score) and human evaluations. All model weights—including the VAE and captioning model—are publicly released.

Technology Category

Application Category

📝 Abstract
We present CogVideoX, a large-scale text-to-video generation model based on diffusion transformer, which can generate 10-second continuous videos aligned with text prompt, with a frame rate of 16 fps and resolution of 768 * 1360 pixels. Previous video generation models often had limited movement and short durations, and is difficult to generate videos with coherent narratives based on text. We propose several designs to address these issues. First, we propose a 3D Variational Autoencoder (VAE) to compress videos along both spatial and temporal dimensions, to improve both compression rate and video fidelity. Second, to improve the text-video alignment, we propose an expert transformer with the expert adaptive LayerNorm to facilitate the deep fusion between the two modalities. Third, by employing a progressive training and multi-resolution frame pack technique, CogVideoX is adept at producing coherent, long-duration, different shape videos characterized by significant motions. In addition, we develop an effective text-video data processing pipeline that includes various data preprocessing strategies and a video captioning method, greatly contributing to the generation quality and semantic alignment. Results show that CogVideoX demonstrates state-of-the-art performance across both multiple machine metrics and human evaluations. The model weight of both 3D Causal VAE, Video caption model and CogVideoX are publicly available at https://github.com/THUDM/CogVideo.
Problem

Research questions and friction points this paper is trying to address.

Generating long-duration coherent videos from text prompts
Improving text-video alignment with expert transformer design
Enhancing video fidelity and motion with 3D VAE compression
Innovation

Methods, ideas, or system contributions that make the work stand out.

3D VAE compresses videos spatially and temporally
Expert transformer enhances text-video alignment
Progressive training enables coherent long-duration videos
🔎 Similar Papers
No similar papers found.