Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation

📅 2023-09-27
🏛️ International Journal of Computer Vision
📈 Citations: 216
Influential: 11
📄 PDF
🤖 AI Summary
Existing text-to-video generation methods face a fundamental trade-off: pixel-level diffusion models incur prohibitive computational costs (72 GB GPU memory), whereas latent diffusion models struggle to ensure precise text-video alignment. This paper introduces Show-1—the first synergistic framework unifying pixel-space and latent-space video diffusion models (VDMs). It first generates low-resolution videos with strong semantic alignment using a pixel-space VDM, then employs a novel “expert translation mechanism” to drive a latent-space VDM for high-fidelity upsampling and detail refinement. Key contributions include: (1) a dual-domain (pixel + latent) collaborative architecture; (2) an expert translation upsampling paradigm bridging heterogeneous representation spaces; and (3) motion customization and style transfer achievable via fine-tuning only the temporal attention layers. Show-1 achieves state-of-the-art performance on standard benchmarks while reducing inference memory consumption to 15 GB—effectively balancing high visual fidelity and accurate text-video alignment.
📝 Abstract
Significant advancements have been achieved in the realm of large-scale pre-trained text-to-video Diffusion Models (VDMs). However, previous methods either rely solely on pixel-based VDMs, which come with high computational costs, or on latent-based VDMs, which often struggle with precise text-video alignment. In this paper, we are the first to propose a hybrid model, dubbed as Show-1, which marries pixel-based and latent-based VDMs for text-to-video generation. Our model first uses pixel-based VDMs to produce a low-resolution video of strong text-video correlation. After that, we propose a novel expert translation method that employs the latent-based VDMs to further upsample the low-resolution video to high resolution, which can also remove potential artifacts and corruptions from low-resolution videos. Compared to latent VDMs, Show-1 can produce high-quality videos of precise text-video alignment; Compared to pixel VDMs, Show-1 is much more efficient (GPU memory usage during inference is 15G vs 72G). Furthermore, our Show-1 model can be readily adapted for motion customization and video stylization applications through simple temporal attention layer finetuning. Our model achieves state-of-the-art performance on standard video generation benchmarks. Our code and model weights are publicly available at https://github.com/showlab/Show-1.
Problem

Research questions and friction points this paper is trying to address.

Combining pixel and latent diffusion models for efficient text-to-video generation
Improving text-video alignment while reducing computational costs
Enhancing video resolution and quality with artifact removal
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines pixel and latent diffusion models
Uses expert translation for upsampling
Enables efficient high-resolution video generation
🔎 Similar Papers
No similar papers found.
D
David Junhao Zhang
Show Lab, National University of Singapore
J
Jay Zhangjie Wu
Show Lab, National University of Singapore
J
Jia-Wei Liu
Show Lab, National University of Singapore
R
Rui Zhao
Show Lab, National University of Singapore
L
L. Ran
Show Lab, National University of Singapore
Yuchao Gu
Yuchao Gu
National University of Singapore
Generative ModelsVisual GenerationMulti-Modal Generation
Difei Gao
Difei Gao
National U. of Singapore; Institute of Computing Technology, Chinese Academy of Sciences
Artificial IntelligenceAI AgentVision and Language
M
Mike Zheng Shou
Show Lab, National University of Singapore