MotionAura: Generating High-Quality and Motion Consistent Videos using Discrete Diffusion

📅 2024-10-10

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Video generation faces fundamental challenges in motion discontinuity and spatiotemporal modeling. To address these, we propose the first end-to-end discrete diffusion framework for high-fidelity, temporally coherent text-to-video synthesis. Methodologically: (1) We design a 3D-MBQ-VAE latent-space compressor, pioneering the integration of moving inverted vector quantization into 3D video representation; (2) We introduce a spectral Transformer denoiser operating in the Fourier domain to effectively capture long-range spatiotemporal dependencies; (3) We adopt full-frame masking during training and LoRA-based efficient fine-tuning, enabling fine-grained controllable editing—including sketch-guided inpainting. Our approach achieves state-of-the-art performance across multiple video generation and reconstruction benchmarks, significantly improving motion consistency and visual detail fidelity. All code, pretrained models, and datasets are publicly released.

Technology Category

Application Category

📝 Abstract

The spatio-temporal complexity of video data presents significant challenges in tasks such as compression, generation, and inpainting. We present four key contributions to address the challenges of spatiotemporal video processing. First, we introduce the 3D Mobile Inverted Vector-Quantization Variational Autoencoder (3D-MBQ-VAE), which combines Variational Autoencoders (VAEs) with masked token modeling to enhance spatiotemporal video compression. The model achieves superior temporal consistency and state-of-the-art (SOTA) reconstruction quality by employing a novel training strategy with full frame masking. Second, we present MotionAura, a text-to-video generation framework that utilizes vector-quantized diffusion models to discretize the latent space and capture complex motion dynamics, producing temporally coherent videos aligned with text prompts. Third, we propose a spectral transformer-based denoising network that processes video data in the frequency domain using the Fourier Transform. This method effectively captures global context and long-range dependencies for high-quality video generation and denoising. Lastly, we introduce a downstream task of Sketch Guided Video Inpainting. This task leverages Low-Rank Adaptation (LoRA) for parameter-efficient fine-tuning. Our models achieve SOTA performance on a range of benchmarks. Our work offers robust frameworks for spatiotemporal modeling and user-driven video content manipulation. We will release the code, datasets, and models in open-source.

Problem

Research questions and friction points this paper is trying to address.

Enhance spatiotemporal video compression using 3D-MBQ-VAE.

Generate motion-consistent videos from text with MotionAura.

Improve video denoising via spectral transformer-based networks.

Innovation

Methods, ideas, or system contributions that make the work stand out.

3D-MBQ-VAE enhances video compression with VAEs

MotionAura uses diffusion models for text-to-video

Spectral transformer denoises videos in frequency domain

🔎 Similar Papers

No similar papers found.

Nvidia

The base salary range is 184,000 USD - 287,500 USD for Level 4, and 224,000 USD - 356,500 USD for Level 5. You will also be eligible for equity and benefits.

US, CA, Remote / US, WA, Remote / US, OR, Remote

Research Engineer/Scientist (all levels), World Models

TikTok

San Jose, California

AI Research Scientist, Computer Vision - Facebook Video Intelligence