🤖 AI Summary
Existing text-to-video generation methods face a fundamental trade-off: pixel-level diffusion models incur prohibitive computational costs (72 GB GPU memory), whereas latent diffusion models struggle to ensure precise text-video alignment. This paper introduces Show-1—the first synergistic framework unifying pixel-space and latent-space video diffusion models (VDMs). It first generates low-resolution videos with strong semantic alignment using a pixel-space VDM, then employs a novel “expert translation mechanism” to drive a latent-space VDM for high-fidelity upsampling and detail refinement. Key contributions include: (1) a dual-domain (pixel + latent) collaborative architecture; (2) an expert translation upsampling paradigm bridging heterogeneous representation spaces; and (3) motion customization and style transfer achievable via fine-tuning only the temporal attention layers. Show-1 achieves state-of-the-art performance on standard benchmarks while reducing inference memory consumption to 15 GB—effectively balancing high visual fidelity and accurate text-video alignment.
📝 Abstract
Significant advancements have been achieved in the realm of large-scale pre-trained text-to-video Diffusion Models (VDMs). However, previous methods either rely solely on pixel-based VDMs, which come with high computational costs, or on latent-based VDMs, which often struggle with precise text-video alignment. In this paper, we are the first to propose a hybrid model, dubbed as Show-1, which marries pixel-based and latent-based VDMs for text-to-video generation. Our model first uses pixel-based VDMs to produce a low-resolution video of strong text-video correlation. After that, we propose a novel expert translation method that employs the latent-based VDMs to further upsample the low-resolution video to high resolution, which can also remove potential artifacts and corruptions from low-resolution videos. Compared to latent VDMs, Show-1 can produce high-quality videos of precise text-video alignment; Compared to pixel VDMs, Show-1 is much more efficient (GPU memory usage during inference is 15G vs 72G). Furthermore, our Show-1 model can be readily adapted for motion customization and video stylization applications through simple temporal attention layer finetuning. Our model achieves state-of-the-art performance on standard video generation benchmarks. Our code and model weights are publicly available at https://github.com/showlab/Show-1.