A$^2$RD: Agentic Autoregressive Diffusion for Long Video Consistency

📅 2026-05-07
📈 Citations: 0
Influential: 0
📄 PDF

career value

221K/year
🤖 AI Summary
This work addresses the challenges of semantic drift and narrative collapse that undermine consistency and coherence in long-form video generation. To this end, the authors propose a closed-loop, segment-wise generation and self-optimization framework that decouples creative synthesis from consistency constraints through an iterative “retrieve–synthesize–refine–update” cycle. The approach integrates multimodal video memory, adaptive segmentation, hierarchical test-time self-optimization, and multi-generation mode switching. Evaluated on established benchmarks and a newly introduced LVBench-C dataset, the method achieves up to a 30% improvement in visual-temporal consistency and a 20% gain in narrative coherence, with human evaluations further confirming significant enhancements in motion smoothness and scene transition quality.
📝 Abstract
Synthesizing consistent and coherent long video remains a fundamental challenge. Existing methods suffer from semantic drift and narrative collapse over long horizons. We present A$^2$RD, an Agentic Auto-Regressive Diffusion architecture that decouples creative synthesis from consistency enforcement. A$^2$RD formulates long video synthesis as a closed-loop process that synthesizes and self-improves video segment-by-segment through a Retrieve--Synthesize--Refine--Update cycle. It comprises three core components: (i) Multimodal Video Memory that tracks video progression across modalities; (ii) Adaptive Segment Generation that switches among generation modes for natural progression and visual consistency; and (iii) Hierarchical Test-Time Self-Improvement that self-improves each segment at frame and video levels to prevent error propagation. We further introduce LVBench-C, a challenging benchmark with non-linear entity and environment transitions to stress-test long-horizon consistency. Across public and LVBench-C benchmarks spanning one- to ten-minute videos, A$^2$RD outperforms state-of-the-art baselines by up to 30% in consistency and 20% in narrative coherence. Human evaluations corroborate these gains while also highlighting notable improvements in motion and transition smoothness.
Problem

Research questions and friction points this paper is trying to address.

long video synthesis
semantic drift
narrative coherence
video consistency
temporal consistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Agentic Autoregressive Diffusion
Long Video Synthesis
Multimodal Video Memory
Self-Improvement
Consistency Enforcement