A$^2$RD: Agentic Autoregressive Diffusion for Long Video Consistency

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

This work addresses the challenges of semantic drift and narrative collapse that undermine consistency and coherence in long-form video generation. To this end, the authors propose a closed-loop, segment-wise generation and self-optimization framework that decouples creative synthesis from consistency constraints through an iterative “retrieve–synthesize–refine–update” cycle. The approach integrates multimodal video memory, adaptive segmentation, hierarchical test-time self-optimization, and multi-generation mode switching. Evaluated on established benchmarks and a newly introduced LVBench-C dataset, the method achieves up to a 30% improvement in visual-temporal consistency and a 20% gain in narrative coherence, with human evaluations further confirming significant enhancements in motion smoothness and scene transition quality.

📝 Abstract

Synthesizing consistent and coherent long video remains a fundamental challenge. Existing methods suffer from semantic drift and narrative collapse over long horizons. We present A$^2$RD, an Agentic Auto-Regressive Diffusion architecture that decouples creative synthesis from consistency enforcement. A$^2$RD formulates long video synthesis as a closed-loop process that synthesizes and self-improves video segment-by-segment through a Retrieve--Synthesize--Refine--Update cycle. It comprises three core components: (i) Multimodal Video Memory that tracks video progression across modalities; (ii) Adaptive Segment Generation that switches among generation modes for natural progression and visual consistency; and (iii) Hierarchical Test-Time Self-Improvement that self-improves each segment at frame and video levels to prevent error propagation. We further introduce LVBench-C, a challenging benchmark with non-linear entity and environment transitions to stress-test long-horizon consistency. Across public and LVBench-C benchmarks spanning one- to ten-minute videos, A$^2$RD outperforms state-of-the-art baselines by up to 30% in consistency and 20% in narrative coherence. Human evaluations corroborate these gains while also highlighting notable improvements in motion and transition smoothness.

Problem

Research questions and friction points this paper is trying to address.

long video synthesis

semantic drift

narrative coherence

video consistency

temporal consistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Agentic Autoregressive Diffusion

Long Video Synthesis

Multimodal Video Memory