🤖 AI Summary
This work investigates lifelong learning for autoregressive video diffusion models under a single-video-stream training paradigm, aiming to match offline training performance during continual streaming. Methodologically, it introduces a lightweight subset-based experience replay mechanism that selectively caches critical historical frames to mitigate catastrophic forgetting, and establishes three million-scale synthetic benchmarks—Bouncing Balls, 3D Maze, and PLAICraft—to enable controlled, reproducible lifelong learning evaluation. Key contributions are threefold: (1) the first empirical demonstration that single-video-stream lifelong training achieves performance on par with offline training; (2) an efficient replay strategy that substantially reduces memory footprint and retraining overhead; and (3) the open-sourcing of the first benchmark suite dedicated to lifelong learning for video generation. Experiments show no statistically significant performance gap versus offline training in standard metrics—including FVD and LPIPS—under identical gradient step budgets.
📝 Abstract
This work demonstrates that training autoregressive video diffusion models from a single, continuous video stream is not only possible but remarkably can also be competitive with standard offline training approaches given the same number of gradient steps. Our demonstration further reveals that this main result can be achieved using experience replay that only retains a subset of the preceding video stream. We also contribute three new single video generative modeling datasets suitable for evaluating lifelong video model learning: Lifelong Bouncing Balls, Lifelong 3D Maze, and Lifelong PLAICraft. Each dataset contains over a million consecutive frames from a synthetic environment of increasing complexity.