Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation

📅 2023-09-27

🏛️ International Journal of Computer Vision

📈 Citations: 216

✨ Influential: 11

career value

207K/year

🤖 AI Summary

Existing text-to-video generation methods face a fundamental trade-off: pixel-level diffusion models incur prohibitive computational costs (72 GB GPU memory), whereas latent diffusion models struggle to ensure precise text-video alignment. This paper introduces Show-1—the first synergistic framework unifying pixel-space and latent-space video diffusion models (VDMs). It first generates low-resolution videos with strong semantic alignment using a pixel-space VDM, then employs a novel “expert translation mechanism” to drive a latent-space VDM for high-fidelity upsampling and detail refinement. Key contributions include: (1) a dual-domain (pixel + latent) collaborative architecture; (2) an expert translation upsampling paradigm bridging heterogeneous representation spaces; and (3) motion customization and style transfer achievable via fine-tuning only the temporal attention layers. Show-1 achieves state-of-the-art performance on standard benchmarks while reducing inference memory consumption to 15 GB—effectively balancing high visual fidelity and accurate text-video alignment.

📝 Abstract

Significant advancements have been achieved in the realm of large-scale pre-trained text-to-video Diffusion Models (VDMs). However, previous methods either rely solely on pixel-based VDMs, which come with high computational costs, or on latent-based VDMs, which often struggle with precise text-video alignment. In this paper, we are the first to propose a hybrid model, dubbed as Show-1, which marries pixel-based and latent-based VDMs for text-to-video generation. Our model first uses pixel-based VDMs to produce a low-resolution video of strong text-video correlation. After that, we propose a novel expert translation method that employs the latent-based VDMs to further upsample the low-resolution video to high resolution, which can also remove potential artifacts and corruptions from low-resolution videos. Compared to latent VDMs, Show-1 can produce high-quality videos of precise text-video alignment; Compared to pixel VDMs, Show-1 is much more efficient (GPU memory usage during inference is 15G vs 72G). Furthermore, our Show-1 model can be readily adapted for motion customization and video stylization applications through simple temporal attention layer finetuning. Our model achieves state-of-the-art performance on standard video generation benchmarks. Our code and model weights are publicly available at https://github.com/showlab/Show-1.

Problem

Research questions and friction points this paper is trying to address.

Combining pixel and latent diffusion models for efficient text-to-video generation

Improving text-video alignment while reducing computational costs

Enhancing video resolution and quality with artifact removal

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines pixel and latent diffusion models

Uses expert translation for upsampling

Enables efficient high-resolution video generation

🔎 Similar Papers

Grid Diffusion Models for Text-to-Video Generation