Recurrent Autoregressive Diffusion: Global Memory Meets Local Attention

📅 2025-11-16

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Existing video generation models suffer from historical information forgetting and spatiotemporal inconsistency in long-video synthesis, primarily due to the absence of efficient memory compression and retrieval mechanisms. To address this, we propose RAD—a Recurrent Autoregressive Diffusion framework—that integrates an LSTM into a diffusion Transformer to enable frame-level autoregressive memory updating and retrieval, unifying memory handling across training and inference. RAD further couples a global memory store with local attention, balancing long-range dependency modeling and computational efficiency. Evaluated on Memory Maze and Minecraft benchmarks, RAD significantly improves temporal coherence and visual fidelity of generated videos. Our ablation studies confirm the effectiveness of LSTM-enhanced memory in mitigating forgetting and enhancing consistency. RAD establishes a scalable, high-fidelity paradigm for long-video generation.

Technology Category

Application Category

📝 Abstract

Recent advancements in video generation have demonstrated the potential of using video diffusion models as world models, with autoregressive generation of infinitely long videos through masked conditioning. However, such models, usually with local full attention, lack effective memory compression and retrieval for long-term generation beyond the window size, leading to issues of forgetting and spatiotemporal inconsistencies. To enhance the retention of historical information within a fixed memory budget, we introduce a recurrent neural network (RNN) into the diffusion transformer framework. Specifically, a diffusion model incorporating LSTM with attention achieves comparable performance to state-of-the-art RNN blocks, such as TTT and Mamba2. Moreover, existing diffusion-RNN approaches often suffer from performance degradation due to training-inference gap or the lack of overlap across windows. To address these limitations, we propose a novel Recurrent Autoregressive Diffusion (RAD) framework, which executes frame-wise autoregression for memory update and retrieval, consistently across training and inference time. Experiments on Memory Maze and Minecraft datasets demonstrate the superiority of RAD for long video generation, highlighting the efficiency of LSTM in sequence modeling.

Problem

Research questions and friction points this paper is trying to address.

Video diffusion models lack memory compression for long-term generation

Existing approaches suffer from training-inference gap and window inconsistencies

Current methods struggle with forgetting and spatiotemporal inconsistencies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Incorporates LSTM with attention into diffusion transformer

Proposes Recurrent Autoregressive Diffusion framework for consistency

Uses frame-wise autoregression for memory update and retrieval

🔎 Similar Papers

Faster Diffusion via Temporal Attention Decomposition