ARLON: Boosting Diffusion Transformers with Autoregressive Models for Long Video Generation

📅 2024-10-27
🏛️ arXiv.org
📈 Citations: 3
Influential: 0
📄 PDF

career value

204K/year
🤖 AI Summary
To address weak motion dynamics and temporal inconsistency in long-video generation, this paper proposes ARLON: an autoregressive–diffusion framework integrating autoregressive modeling with a diffusion Transformer (DiT). It employs an autoregressive structure to model coarse-grained spatiotemporal priors, guiding high-fidelity long-video synthesis. Key innovations include: (i) a VQ-VAE compressed latent space serving as a visual token bridge; (ii) an adaptive normalization-based semantic injection module for fine-grained conditional control; and (iii) uncertainty-aware sampling to enhance noise robustness. On VBench across 11 metrics, ARLON outperforms OpenSora-V1.2 on 8 metrics, achieving new state-of-the-art performance in dynamic coherence and aesthetic quality. Moreover, it supports progressive text-prompted generation and offers inference acceleration benefits.

Technology Category

Application Category

📝 Abstract
Text-to-video models have recently undergone rapid and substantial advancements. Nevertheless, due to limitations in data and computational resources, achieving efficient generation of long videos with rich motion dynamics remains a significant challenge. To generate high-quality, dynamic, and temporally consistent long videos, this paper presents ARLON, a novel framework that boosts diffusion Transformers with autoregressive models for long video generation, by integrating the coarse spatial and long-range temporal information provided by the AR model to guide the DiT model. Specifically, ARLON incorporates several key innovations: 1) A latent Vector Quantized Variational Autoencoder (VQ-VAE) compresses the input latent space of the DiT model into compact visual tokens, bridging the AR and DiT models and balancing the learning complexity and information density; 2) An adaptive norm-based semantic injection module integrates the coarse discrete visual units from the AR model into the DiT model, ensuring effective guidance during video generation; 3) To enhance the tolerance capability of noise introduced from the AR inference, the DiT model is trained with coarser visual latent tokens incorporated with an uncertainty sampling module. Experimental results demonstrate that ARLON significantly outperforms the baseline OpenSora-V1.2 on eight out of eleven metrics selected from VBench, with notable improvements in dynamic degree and aesthetic quality, while delivering competitive results on the remaining three and simultaneously accelerating the generation process. In addition, ARLON achieves state-of-the-art performance in long video generation. Detailed analyses of the improvements in inference efficiency are presented, alongside a practical application that demonstrates the generation of long videos using progressive text prompts. See demos of ARLON at http://aka.ms/arlon.
Problem

Research questions and friction points this paper is trying to address.

Enhance long video generation quality
Integrate autoregressive and diffusion models
Improve dynamic consistency in videos
Innovation

Methods, ideas, or system contributions that make the work stand out.

VQ-VAE compresses DiT input latent space
Adaptive norm-based semantic injection module
DiT trained with coarser visual tokens
🔎 Similar Papers