🤖 AI Summary
Autoregressive video diffusion models suffer from exposure bias: during training, they condition on ground-truth past frames, whereas during inference, they rely on self-generated frames—leading to error accumulation and long-horizon quality drift. This paper introduces the first self-supervised backward aggregation framework, which constructs correction trajectories solely via model rollouts—requiring no external supervision, large teacher models, or multi-step distillation. Built upon a causal diffusion Transformer architecture, our method is compatible with standard score matching or flow matching objectives, enables end-to-end training, and avoids costly long-range temporal backpropagation. Evaluated on text-to-video generation, video extrapolation, and multi-prompt synthesis, it significantly mitigates drift, enhances stability in long-horizon motion modeling, and improves visual consistency—while preserving generative diversity.
📝 Abstract
Autoregressive video models are promising for world modeling via next-frame prediction, but they suffer from exposure bias: a mismatch between training on clean contexts and inference on self-generated frames, causing errors to compound and quality to drift over time. We introduce Backwards Aggregation (BAgger), a self-supervised scheme that constructs corrective trajectories from the model's own rollouts, teaching it to recover from its mistakes. Unlike prior approaches that rely on few-step distillation and distribution-matching losses, which can hurt quality and diversity, BAgger trains with standard score or flow matching objectives, avoiding large teachers and long-chain backpropagation through time. We instantiate BAgger on causal diffusion transformers and evaluate on text-to-video, video extension, and multi-prompt generation, observing more stable long-horizon motion and better visual consistency with reduced drift.