🤖 AI Summary
Autoregressive video diffusion models (AR-VDMs) suffer from insufficient sample fidelity, while existing inference-time alignment methods incur high computational overhead and lack adaptability. Method: We propose the first path-wise noise refinement framework tailored for AR-VDMs. During inference, it performs reflective noise re-refinement along stochastic denoising trajectories—requiring no model fine-tuning or parameter updates. Key innovations include a path-aware noise reweighting mechanism, a feedforward noise modulation module, and a reflective KV cache designed to preserve autoregressive dependencies—overcoming the failure of direct adaptation from text-to-image noise refiners. Contribution/Results: As a lightweight plug-in, our method introduces negligible computational overhead, significantly improving inter-frame consistency and fine-grained detail fidelity. It enables real-time video generation and supports interactive applications.
📝 Abstract
Autoregressive video diffusion models (AR-VDMs) show strong promise as scalable alternatives to bidirectional VDMs, enabling real-time and interactive applications. Yet there remains room for improvement in their sample fidelity. A promising solution is inference-time alignment, which optimizes the noise space to improve sample fidelity without updating model parameters. Yet, optimization- or search-based methods are computationally impractical for AR-VDMs. Recent text-to-image (T2I) works address this via feedforward noise refiners that modulate sampled noises in a single forward pass. Can such noise refiners be extended to AR-VDMs? We identify the failure of naively extending T2I noise refiners to AR-VDMs and propose AutoRefiner-a noise refiner tailored for AR-VDMs, with two key designs: pathwise noise refinement and a reflective KV-cache. Experiments demonstrate that AutoRefiner serves as an efficient plug-in for AR-VDMs, effectively enhancing sample fidelity by refining noise along stochastic denoising paths.