MotionAura: Generating High-Quality and Motion Consistent Videos using Discrete Diffusion

๐Ÿ“… 2024-10-10
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Video generation faces fundamental challenges in motion discontinuity and spatiotemporal modeling. To address these, we propose the first end-to-end discrete diffusion framework for high-fidelity, temporally coherent text-to-video synthesis. Methodologically: (1) We design a 3D-MBQ-VAE latent-space compressor, pioneering the integration of moving inverted vector quantization into 3D video representation; (2) We introduce a spectral Transformer denoiser operating in the Fourier domain to effectively capture long-range spatiotemporal dependencies; (3) We adopt full-frame masking during training and LoRA-based efficient fine-tuning, enabling fine-grained controllable editingโ€”including sketch-guided inpainting. Our approach achieves state-of-the-art performance across multiple video generation and reconstruction benchmarks, significantly improving motion consistency and visual detail fidelity. All code, pretrained models, and datasets are publicly released.

Technology Category

Application Category

๐Ÿ“ Abstract
The spatio-temporal complexity of video data presents significant challenges in tasks such as compression, generation, and inpainting. We present four key contributions to address the challenges of spatiotemporal video processing. First, we introduce the 3D Mobile Inverted Vector-Quantization Variational Autoencoder (3D-MBQ-VAE), which combines Variational Autoencoders (VAEs) with masked token modeling to enhance spatiotemporal video compression. The model achieves superior temporal consistency and state-of-the-art (SOTA) reconstruction quality by employing a novel training strategy with full frame masking. Second, we present MotionAura, a text-to-video generation framework that utilizes vector-quantized diffusion models to discretize the latent space and capture complex motion dynamics, producing temporally coherent videos aligned with text prompts. Third, we propose a spectral transformer-based denoising network that processes video data in the frequency domain using the Fourier Transform. This method effectively captures global context and long-range dependencies for high-quality video generation and denoising. Lastly, we introduce a downstream task of Sketch Guided Video Inpainting. This task leverages Low-Rank Adaptation (LoRA) for parameter-efficient fine-tuning. Our models achieve SOTA performance on a range of benchmarks. Our work offers robust frameworks for spatiotemporal modeling and user-driven video content manipulation. We will release the code, datasets, and models in open-source.
Problem

Research questions and friction points this paper is trying to address.

Enhance spatiotemporal video compression using 3D-MBQ-VAE.
Generate motion-consistent videos from text with MotionAura.
Improve video denoising via spectral transformer-based networks.
Innovation

Methods, ideas, or system contributions that make the work stand out.

3D-MBQ-VAE enhances video compression with VAEs
MotionAura uses diffusion models for text-to-video
Spectral transformer denoises videos in frequency domain
๐Ÿ”Ž Similar Papers
No similar papers found.