BAgger: Backwards Aggregation for Mitigating Drift in Autoregressive Video Diffusion Models

📅 2025-12-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Autoregressive video diffusion models suffer from exposure bias: during training, they condition on ground-truth past frames, whereas during inference, they rely on self-generated frames—leading to error accumulation and long-horizon quality drift. This paper introduces the first self-supervised backward aggregation framework, which constructs correction trajectories solely via model rollouts—requiring no external supervision, large teacher models, or multi-step distillation. Built upon a causal diffusion Transformer architecture, our method is compatible with standard score matching or flow matching objectives, enables end-to-end training, and avoids costly long-range temporal backpropagation. Evaluated on text-to-video generation, video extrapolation, and multi-prompt synthesis, it significantly mitigates drift, enhances stability in long-horizon motion modeling, and improves visual consistency—while preserving generative diversity.

Technology Category

Application Category

📝 Abstract
Autoregressive video models are promising for world modeling via next-frame prediction, but they suffer from exposure bias: a mismatch between training on clean contexts and inference on self-generated frames, causing errors to compound and quality to drift over time. We introduce Backwards Aggregation (BAgger), a self-supervised scheme that constructs corrective trajectories from the model's own rollouts, teaching it to recover from its mistakes. Unlike prior approaches that rely on few-step distillation and distribution-matching losses, which can hurt quality and diversity, BAgger trains with standard score or flow matching objectives, avoiding large teachers and long-chain backpropagation through time. We instantiate BAgger on causal diffusion transformers and evaluate on text-to-video, video extension, and multi-prompt generation, observing more stable long-horizon motion and better visual consistency with reduced drift.
Problem

Research questions and friction points this paper is trying to address.

Mitigates exposure bias in autoregressive video models
Reduces error accumulation and quality drift over time
Improves long-horizon motion stability and visual consistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised Backwards Aggregation corrects errors from model rollouts
Uses standard score or flow matching to avoid distillation losses
Applied to causal diffusion transformers for stable video generation
🔎 Similar Papers
No similar papers found.