Grid Diffusion Models for Text-to-Video Generation

📅 2024-03-30
🏛️ Computer Vision and Pattern Recognition
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
Existing text-to-video generation methods rely on large-scale datasets and substantial computational resources, yet struggle to effectively model long-range temporal dynamics, limiting both generation quality and efficiency. To address this, we propose GridDiff—a novel video generation framework based on 2D diffusion that encodes videos as grid-shaped images, enabling the first explicit spatiotemporal disentanglement. GridDiff eliminates the need for 3D convolutions or autoregressive modeling, achieving efficient training and sampling using only a standard 2D U-Net. This design permits arbitrary-length video synthesis under fixed memory constraints and natively supports integration with pre-trained image diffusion models and editing techniques. Trained solely on video-text paired data, GridDiff achieves state-of-the-art performance across multiple benchmarks, demonstrating significant improvements in visual fidelity, temporal coherence, and computational efficiency—both quantitatively and qualitatively.

Technology Category

Application Category

📝 Abstract
Recent advances in the diffusion models have significantly improved text-to-image generation. However, generating videos from text is a more challenging task than generating images from text, due to the much larger dataset and higher computational cost required. Most existing video generation methods use either a 3D U-Net architecture that considers the temporal dimension or autoregressive generation. These methods require large datasets and are limited in terms of computational costs compared to text-to-image generation. To tackle these challenges, we propose a simple but effective novel grid diffusion for text-to-video generation without temporal dimension in architecture and a large text-video paired dataset. We can generate a high-quality video using a fixed amount of GPU memory regardless of the number of frames by representing the video as a grid image. Additionally, since our method reduces the dimensions of the video to the dimensions of the image, various image-based methods can be applied to videos, such as textguided video manipulation from image manipulation. Our proposed method outperforms the existing methods in both quantitative and qualitative evaluations, demonstrating the suitability of our model for real-world video generation.
Problem

Research questions and friction points this paper is trying to address.

Text-to-video generation
Data-intensive
Temporal dimension challenges
Innovation

Methods, ideas, or system contributions that make the work stand out.

Grid Diffusion Model
Text-to-Video Generation
Memory-efficient Processing
🔎 Similar Papers
No similar papers found.