Knot Forcing: Taming Autoregressive Video Diffusion Models for Real-time Infinite Interactive Portrait Animation

📅 2025-12-25

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Real-time portrait animation for virtual assistants and interactive avatars faces challenges in achieving high fidelity, low latency, infinite sequence length, and long-term temporal consistency. To address these, this paper proposes a “node-forced” causal autoregressive video generation framework. Our method innovatively integrates: (1) a reference-image identity anchoring mechanism leveraging KV caching; (2) a “pre-run” mechanism combining sliding-window attention with dynamic temporal coordinate offsets to enable proactive feature updates; and (3) a spatiotemporal cue propagation module with inter-block overlap to mitigate error accumulation and motion discontinuities. Implemented on consumer-grade GPUs, the framework achieves real-time inference at >30 FPS, supports arbitrarily long sequences, and maintains sub-millisecond response latency. It significantly improves long-term temporal coherence and inter-block motion smoothness, establishing a new benchmark for visual stability in interactive applications.

Technology Category

Application Category

📝 Abstract

Real-time portrait animation is essential for interactive applications such as virtual assistants and live avatars, requiring high visual fidelity, temporal coherence, ultra-low latency, and responsive control from dynamic inputs like reference images and driving signals. While diffusion-based models achieve strong quality, their non-causal nature hinders streaming deployment. Causal autoregressive video generation approaches enable efficient frame-by-frame generation but suffer from error accumulation, motion discontinuities at chunk boundaries, and degraded long-term consistency. In this work, we present a novel streaming framework named Knot Forcing for real-time portrait animation that addresses these challenges through three key designs: (1) a chunk-wise generation strategy with global identity preservation via cached KV states of the reference image and local temporal modeling using sliding window attention; (2) a temporal knot module that overlaps adjacent chunks and propagates spatio-temporal cues via image-to-video conditioning to smooth inter-chunk motion transitions; and (3) A "running ahead" mechanism that dynamically updates the reference frame's temporal coordinate during inference, keeping its semantic context ahead of the current rollout frame to support long-term coherence. Knot Forcing enables high-fidelity, temporally consistent, and interactive portrait animation over infinite sequences, achieving real-time performance with strong visual stability on consumer-grade GPUs.

Problem

Research questions and friction points this paper is trying to address.

Enables real-time infinite interactive portrait animation

Addresses error accumulation and motion discontinuities in autoregressive models

Ensures long-term coherence and smooth inter-chunk transitions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Chunk-wise generation with cached KV states

Temporal knot module for smooth motion transitions

Dynamic reference frame update for long-term coherence

🔎 Similar Papers

LivePortrait: Efficient Portrait Animation with Stitching and Retargeting Control