Grounded Forcing: Bridging Time-Independent Semantics and Proximal Dynamics in Autoregressive Video Synthesis

📅 2026-04-08

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses key challenges in autoregressive video generation—namely, semantic forgetting due to limited context, visual drift caused by positional extrapolation, and degraded controllability during interactive prompt switching—by introducing a collaborative modeling framework that jointly captures time-invariant semantics and local dynamics. The proposed approach features three core innovations: a dual-memory key-value caching mechanism that disentangles semantic and dynamic representations, a dual-reference Rotary Position Embedding (RoPE) injection strategy to suppress visual drift, and an asymmetric neighborhood re-caching scheme enabling smooth transitions between prompts. Experimental results demonstrate that this method substantially enhances semantic consistency and visual stability in long-sequence video generation, offering robust support for interactive long-form video synthesis.

📝 Abstract

Autoregressive video synthesis offers a promising pathway for infinite-horizon generation but is fundamentally hindered by three intertwined challenges: semantic forgetting from context limitations, visual drift due to positional extrapolation, and controllability loss during interactive instruction switching. Current methods often tackle these issues in isolation, limiting long-term coherence. We introduce Grounded Forcing, a novel framework that bridges time-independent semantics and proximal dynamics through three interlocking mechanisms. First, to address semantic forgetting, we propose a Dual Memory KV Cache that decouples local temporal dynamics from global semantic anchors, ensuring long-term semantic coherence and identity stability. Second, to suppress visual drift, we design Dual-Reference RoPE Injection, which confines positional embeddings within the training manifold while rendering global semantics time-invariant. Third, to resolve controllability issues, we develop Asymmetric Proximity Recache, which facilitates smooth semantic inheritance during prompt transitions via proximity-weighted cache updates. These components operate synergistically to tether the generative process to stable semantic cores while accommodating flexible local dynamics. Extensive experiments demonstrate that Grounded Forcing significantly enhances long-range consistency and visual stability, establishing a robust foundation for interactive long-form video synthesis.

Problem

Research questions and friction points this paper is trying to address.

semantic forgetting

visual drift

controllability loss

autoregressive video synthesis

long-term coherence

Innovation

Methods, ideas, or system contributions that make the work stand out.

Grounded Forcing

Dual Memory KV Cache

Dual-Reference RoPE Injection

Asymmetric Proximity Recache

Autoregressive Video Synthesis

🔎 Similar Papers

No similar papers found.

Authors to Follow