Open-Sora: Democratizing Efficient Video Production for All

📅 2024-12-29

📈 Citations: 68

✨ Influential: 15

career value

199K/year

🤖 AI Summary

Current AI vision research lags in high-fidelity, long-duration video generation, with a lack of open-source, high-quality, and controllable generative tools. To address this, we introduce Open-Sora—the first fully open-source text- and image-to-video diffusion model capable of generating videos up to 15 seconds long at 720p resolution and arbitrary aspect ratios. Our method centers on three key innovations: (1) the Spatio-Temporal Decoupled Diffusion Transformer (STDiT), a novel architecture enabling efficient modeling of long-range spatio-temporal dependencies; (2) a high-compression-ratio 3D autoencoder paired with a customized training strategy, substantially improving reconstruction fidelity and generalization; and (3) integrated techniques including spatio-temporal separated attention, inference acceleration, and lightweight fine-tuning. All code, pretrained weights, and data processing scripts are publicly released, significantly lowering barriers for research and practical deployment of video AIGC.

Technology Category

Application Category

📝 Abstract

Vision and language are the two foundational senses for humans, and they build up our cognitive ability and intelligence. While significant breakthroughs have been made in AI language ability, artificial visual intelligence, especially the ability to generate and simulate the world we see, is far lagging behind. To facilitate the development and accessibility of artificial visual intelligence, we created Open-Sora, an open-source video generation model designed to produce high-fidelity video content. Open-Sora supports a wide spectrum of visual generation tasks, including text-to-image generation, text-to-video generation, and image-to-video generation. The model leverages advanced deep learning architectures and training/inference techniques to enable flexible video synthesis, which could generate video content of up to 15 seconds, up to 720p resolution, and arbitrary aspect ratios. Specifically, we introduce Spatial-Temporal Diffusion Transformer (STDiT), an efficient diffusion framework for videos that decouples spatial and temporal attention. We also introduce a highly compressive 3D autoencoder to make representations compact and further accelerate training with an ad hoc training strategy. Through this initiative, we aim to foster innovation, creativity, and inclusivity within the community of AI content creation. By embracing the open-source principle, Open-Sora democratizes full access to all the training/inference/data preparation codes as well as model weights. All resources are publicly available at: https://github.com/hpcaitech/Open-Sora.

Problem

Research questions and friction points this paper is trying to address.

AI visual content generation

high-quality video generation

reality mimicry in AI

Innovation

Methods, ideas, or system contributions that make the work stand out.

spatial-temporal decoupling

super-compressed 3D techniques

open-source video generation

🔎 Similar Papers

Detecting AI-Generated Video via Frame Consistency

2024-02-03Citations: 1

TikTok

San Jose, California

AI Research Scientist, Computer Vision - Facebook Video Intelligence