Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion

📅 2026-02-08

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This work addresses the severe degradation in visual quality observed when autoregressive video diffusion models—trained on short clips—are applied to open-ended, long-duration video generation. To bridge this temporal gap without any additional training, the authors propose Rolling Sink, a test-time caching mechanism that integrates a Self-Forcing framework with a systematic autoregressive cache maintenance strategy. By dynamically rolling and updating the cache, Rolling Sink preserves temporal coherence and latent state stability throughout extended generation. This approach achieves state-of-the-art performance on videos spanning 5 to 30 minutes, significantly outperforming existing methods in maintaining subject consistency, color stability, structural coherence, and motion smoothness—all without requiring model retraining or architectural modifications.

Technology Category

Application Category

📝 Abstract

Recently, autoregressive (AR) video diffusion models has achieved remarkable performance. However, due to their limited training durations, a train-test gap emerges when testing at longer horizons, leading to rapid visual degradations. Following Self Forcing, which studies the train-test gap within the training duration, this work studies the train-test gap beyond the training duration, i.e., the gap between the limited horizons during training and open-ended horizons during testing. Since open-ended testing can extend beyond any finite training window, and long-video training is computationally expensive, we pursue a training-free solution to bridge this gap. To explore a training-free solution, we conduct a systematic analysis of AR cache maintenance. These insights lead to Rolling Sink. Built on Self Forcing (trained on only 5s clips), Rolling Sink effectively scales the AR video synthesis to ultra-long durations (e.g., 5-30 minutes at 16 FPS) at test time, with consistent subjects, stable colors, coherent structures, and smooth motions. As demonstrated by extensive experiments, Rolling Sink achieves superior long-horizon visual fidelity and temporal consistency compared to SOTA baselines. Project page: https://rolling-sink.github.io/

Problem

Research questions and friction points this paper is trying to address.

autoregressive video diffusion

train-test gap

limited-horizon training

open-ended testing

long-horizon video generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Rolling Sink

autoregressive video diffusion

train-test gap