Macro-from-Micro Planning for High-Quality and Parallelized Autoregressive Long Video Generation

📅 2025-08-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing autoregressive diffusion models suffer from error accumulation—causing temporal drift—and limited parallelization in long-video generation. This paper proposes Macro-from-Micro Planning (MMP), a novel hierarchical planning framework: the micro-level predicts keyframes within short clips; the macro-level constructs an autoregressive sequence across clips to ensure long-term temporal consistency; and intermediate frames are filled in full-clip parallel fashion. MMP integrates diffusion modeling, keyframe guidance, and adaptive GPU load scheduling, significantly improving computational load balancing and generation efficiency. Experiments demonstrate that MMP outperforms state-of-the-art methods in long-duration video generation, achieving superior visual quality and enhanced temporal stability.

Technology Category

Application Category

📝 Abstract
Current autoregressive diffusion models excel at video generation but are generally limited to short temporal durations. Our theoretical analysis indicates that the autoregressive modeling typically suffers from temporal drift caused by error accumulation and hinders parallelization in long video synthesis. To address these limitations, we propose a novel planning-then-populating framework centered on Macro-from-Micro Planning (MMPL) for long video generation. MMPL sketches a global storyline for the entire video through two hierarchical stages: Micro Planning and Macro Planning. Specifically, Micro Planning predicts a sparse set of future keyframes within each short video segment, offering motion and appearance priors to guide high-quality video segment generation. Macro Planning extends the in-segment keyframes planning across the entire video through an autoregressive chain of micro plans, ensuring long-term consistency across video segments. Subsequently, MMPL-based Content Populating generates all intermediate frames in parallel across segments, enabling efficient parallelization of autoregressive generation. The parallelization is further optimized by Adaptive Workload Scheduling for balanced GPU execution and accelerated autoregressive video generation. Extensive experiments confirm that our method outperforms existing long video generation models in quality and stability. Generated videos and comparison results are in our project page.
Problem

Research questions and friction points this paper is trying to address.

Addresses temporal drift in autoregressive long video generation
Enhances parallelization for efficient long video synthesis
Ensures long-term consistency across video segments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Macro-from-Micro Planning for long videos
Hierarchical keyframe prediction for consistency
Parallelized generation with adaptive scheduling
🔎 Similar Papers
No similar papers found.
Xunzhi Xiang
Xunzhi Xiang
Nanjing University
Yabo Chen
Yabo Chen
Shanghai Jiaotong University
Self-supervised Learning
Guiyu Zhang
Guiyu Zhang
The Chinese University of Hong Kong (Shenzhen)
Computer visionPattern RecognitionMachine Learning
Z
Zhongyu Wang
TeleAI
Z
Zhe Gao
Nanjing University
Q
Quanming Xiang
University of Chinese Academy of Sciences
G
Gonghu Shang
TeleAI
J
Junqi Liu
TeleAI
Haibin Huang
Haibin Huang
Principal Research Scientist at TeleAI
Computer GraphicsComputer VisionGeometric Modeling3D Deep Learning
Y
Yang Gao
Nanjing University
C
Chi Zhang
TeleAI
Q
Qi Fan
Nanjing University
X
Xuelong Li
TeleAI