Anchor Forcing: Anchor Memory and Tri-Region RoPE for Interactive Streaming Video Diffusion

📅 2026-03-12

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This work addresses the challenges in interactive long-form video generation, where abrupt prompt switching often disrupts semantic coherence and unbounded temporal indices cause distributional shifts in positional encoding, leading to degraded visual quality and weakened motion dynamics. To mitigate these issues, the authors propose the Anchor Forcing framework, which stabilizes post-switch context reconstruction through an anchor-guided key-value caching mechanism. Additionally, they introduce a tri-region relative positional encoding (Tri-Region RoPE) combined with realignment distillation to effectively align the model’s positional priors with those learned during pretraining. This approach significantly enhances perceptual quality and motion consistency in streaming video diffusion models over extended durations, outperforming existing interactive baselines.

Technology Category

Application Category

📝 Abstract

Interactive long video generation requires prompt switching to introduce new subjects or events, while maintaining perceptual fidelity and coherent motion over extended horizons. Recent distilled streaming video diffusion models reuse a rolling KV cache for long-range generation, enabling prompt-switch interaction through re-cache at each switch. However, existing streaming methods still exhibit progressive quality degradation and weakened motion dynamics. We identify two failure modes specific to interactive streaming generation: (i) at each prompt switch, current cache maintenance cannot simultaneously retain KV-based semantic context and recent latent cues, resulting in weak boundary conditioning and reduced perceptual quality; and (ii) during distillation, unbounded time indexing induces a positional distribution shift from the pretrained backbone's bounded RoPE regime, weakening pretrained motion priors and long-horizon motion retention. To address these issues, we propose \textbf{Anchor Forcing}, a cache-centric framework with two designs. First, an anchor-guided re-cache mechanism stores KV states in anchor caches and warm-starts re-cache from these anchors at each prompt switch, reducing post-switch evidence loss and stabilizing perceptual quality. Second, a tri-region RoPE with region-specific reference origins, together with RoPE re-alignment distillation, reconciles unbounded streaming indices with the pretrained RoPE regime to better retain motion priors. Experiments on long videos show that our method improves perceptual quality and motion metrics over prior streaming baselines in interactive settings. Project page: https://github.com/vivoCameraResearch/Anchor-Forcing

Problem

Research questions and friction points this paper is trying to address.

interactive video generation

streaming video diffusion

prompt switching

perceptual fidelity

motion coherence

Innovation

Methods, ideas, or system contributions that make the work stand out.

Anchor Forcing

Interactive Streaming Video Diffusion

Anchor-Guided Re-Cache