🤖 AI Summary
This work proposes an efficient Transformer-based image-to-video generation framework to address the high computational cost and slow inference speed of existing diffusion-based video generation models. Operating in a highly compressed latent space (64×64×4), the approach introduces three key innovations: a high-compression video autoencoder, a diffusion Transformer (DiT) architecture enhanced with layer-wise memory mechanisms, and a multi-resolution few-step upsampling strategy. The resulting 14-billion-parameter base model, combined with the proposed upsampler, achieves high-quality video synthesis while accelerating inference by an order of magnitude compared to prevailing open-source models.
📝 Abstract
We introduce FSVideo, a fast speed transformer-based image-to-video (I2V) diffusion framework. We build our framework on the following key components: 1.) a new video autoencoder with highly-compressed latent space ($64\times64\times4$ spatial-temporal downsampling ratio), achieving competitive reconstruction quality; 2.) a diffusion transformer (DIT) architecture with a new layer memory design to enhance inter-layer information flow and context reuse within DIT, and 3.) a multi-resolution generation strategy via a few-step DIT upsampler to increase video fidelity. Our final model, which contains a 14B DIT base model and a 14B DIT upsampler, achieves competitive performance against other popular open-source models, while being an order of magnitude faster. We discuss our model design as well as training strategies in this report.