Knot Forcing: Taming Autoregressive Video Diffusion Models for Real-time Infinite Interactive Portrait Animation

๐Ÿ“… 2025-12-25
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Real-time portrait animation for virtual assistants and interactive avatars faces challenges in achieving high fidelity, low latency, infinite sequence length, and long-term temporal consistency. To address these, this paper proposes a โ€œnode-forcedโ€ causal autoregressive video generation framework. Our method innovatively integrates: (1) a reference-image identity anchoring mechanism leveraging KV caching; (2) a โ€œpre-runโ€ mechanism combining sliding-window attention with dynamic temporal coordinate offsets to enable proactive feature updates; and (3) a spatiotemporal cue propagation module with inter-block overlap to mitigate error accumulation and motion discontinuities. Implemented on consumer-grade GPUs, the framework achieves real-time inference at >30 FPS, supports arbitrarily long sequences, and maintains sub-millisecond response latency. It significantly improves long-term temporal coherence and inter-block motion smoothness, establishing a new benchmark for visual stability in interactive applications.

Technology Category

Application Category

๐Ÿ“ Abstract
Real-time portrait animation is essential for interactive applications such as virtual assistants and live avatars, requiring high visual fidelity, temporal coherence, ultra-low latency, and responsive control from dynamic inputs like reference images and driving signals. While diffusion-based models achieve strong quality, their non-causal nature hinders streaming deployment. Causal autoregressive video generation approaches enable efficient frame-by-frame generation but suffer from error accumulation, motion discontinuities at chunk boundaries, and degraded long-term consistency. In this work, we present a novel streaming framework named Knot Forcing for real-time portrait animation that addresses these challenges through three key designs: (1) a chunk-wise generation strategy with global identity preservation via cached KV states of the reference image and local temporal modeling using sliding window attention; (2) a temporal knot module that overlaps adjacent chunks and propagates spatio-temporal cues via image-to-video conditioning to smooth inter-chunk motion transitions; and (3) A "running ahead" mechanism that dynamically updates the reference frame's temporal coordinate during inference, keeping its semantic context ahead of the current rollout frame to support long-term coherence. Knot Forcing enables high-fidelity, temporally consistent, and interactive portrait animation over infinite sequences, achieving real-time performance with strong visual stability on consumer-grade GPUs.
Problem

Research questions and friction points this paper is trying to address.

Enables real-time infinite interactive portrait animation
Addresses error accumulation and motion discontinuities in autoregressive models
Ensures long-term coherence and smooth inter-chunk transitions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Chunk-wise generation with cached KV states
Temporal knot module for smooth motion transitions
Dynamic reference frame update for long-term coherence
๐Ÿ”Ž Similar Papers
No similar papers found.
S
Steven Xiao
Tongyi Lab, Alibaba Group
X
Xindi Zhang
Tongyi Lab, Alibaba Group
Dechao Meng
Dechao Meng
PhD candidate, Institute of Computing Technology, Chinese Academy of Science
deep learningcomputer vision
Q
Qi Wang
Tongyi Lab, Alibaba Group
P
Peng Zhang
Tongyi Lab, Alibaba Group
B
Bang Zhang
Tongyi Lab, Alibaba Group