FSVideo: Fast Speed Video Diffusion Model in a Highly-Compressed Latent Space

📅 2026-02-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes an efficient Transformer-based image-to-video generation framework to address the high computational cost and slow inference speed of existing diffusion-based video generation models. Operating in a highly compressed latent space (64×64×4), the approach introduces three key innovations: a high-compression video autoencoder, a diffusion Transformer (DiT) architecture enhanced with layer-wise memory mechanisms, and a multi-resolution few-step upsampling strategy. The resulting 14-billion-parameter base model, combined with the proposed upsampler, achieves high-quality video synthesis while accelerating inference by an order of magnitude compared to prevailing open-source models.

Technology Category

Application Category

📝 Abstract
We introduce FSVideo, a fast speed transformer-based image-to-video (I2V) diffusion framework. We build our framework on the following key components: 1.) a new video autoencoder with highly-compressed latent space ($64\times64\times4$ spatial-temporal downsampling ratio), achieving competitive reconstruction quality; 2.) a diffusion transformer (DIT) architecture with a new layer memory design to enhance inter-layer information flow and context reuse within DIT, and 3.) a multi-resolution generation strategy via a few-step DIT upsampler to increase video fidelity. Our final model, which contains a 14B DIT base model and a 14B DIT upsampler, achieves competitive performance against other popular open-source models, while being an order of magnitude faster. We discuss our model design as well as training strategies in this report.
Problem

Research questions and friction points this paper is trying to address.

video diffusion model
fast generation
image-to-video
latent space compression
computational efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

video diffusion model
highly-compressed latent space
diffusion transformer
image-to-video generation
multi-resolution upsampling