Anchor Forcing: Anchor Memory and Tri-Region RoPE for Interactive Streaming Video Diffusion

πŸ“… 2026-03-12
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenges in interactive long-form video generation, where abrupt prompt switching often disrupts semantic coherence and unbounded temporal indices cause distributional shifts in positional encoding, leading to degraded visual quality and weakened motion dynamics. To mitigate these issues, the authors propose the Anchor Forcing framework, which stabilizes post-switch context reconstruction through an anchor-guided key-value caching mechanism. Additionally, they introduce a tri-region relative positional encoding (Tri-Region RoPE) combined with realignment distillation to effectively align the model’s positional priors with those learned during pretraining. This approach significantly enhances perceptual quality and motion consistency in streaming video diffusion models over extended durations, outperforming existing interactive baselines.

Technology Category

Application Category

πŸ“ Abstract
Interactive long video generation requires prompt switching to introduce new subjects or events, while maintaining perceptual fidelity and coherent motion over extended horizons. Recent distilled streaming video diffusion models reuse a rolling KV cache for long-range generation, enabling prompt-switch interaction through re-cache at each switch. However, existing streaming methods still exhibit progressive quality degradation and weakened motion dynamics. We identify two failure modes specific to interactive streaming generation: (i) at each prompt switch, current cache maintenance cannot simultaneously retain KV-based semantic context and recent latent cues, resulting in weak boundary conditioning and reduced perceptual quality; and (ii) during distillation, unbounded time indexing induces a positional distribution shift from the pretrained backbone's bounded RoPE regime, weakening pretrained motion priors and long-horizon motion retention. To address these issues, we propose \textbf{Anchor Forcing}, a cache-centric framework with two designs. First, an anchor-guided re-cache mechanism stores KV states in anchor caches and warm-starts re-cache from these anchors at each prompt switch, reducing post-switch evidence loss and stabilizing perceptual quality. Second, a tri-region RoPE with region-specific reference origins, together with RoPE re-alignment distillation, reconciles unbounded streaming indices with the pretrained RoPE regime to better retain motion priors. Experiments on long videos show that our method improves perceptual quality and motion metrics over prior streaming baselines in interactive settings. Project page: https://github.com/vivoCameraResearch/Anchor-Forcing
Problem

Research questions and friction points this paper is trying to address.

interactive video generation
streaming video diffusion
prompt switching
perceptual fidelity
motion coherence
Innovation

Methods, ideas, or system contributions that make the work stand out.

Anchor Forcing
Interactive Streaming Video Diffusion
Anchor-Guided Re-Cache
Tri-Region RoPE
Motion Prior Retention
πŸ”Ž Similar Papers
No similar papers found.