AvatarForcing: One-Step Streaming Talking Avatars via Local-Future Sliding-Window Denoising

📅 2026-03-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the longstanding challenge in real-time speech-driven digital avatar generation—balancing low latency with long-term stability—where autoregressive approaches suffer from error accumulation and full-sequence diffusion models incur prohibitive computational costs. The authors propose a single-step streaming diffusion framework that generates high-quality video frames chunk-by-chunk under fixed computational overhead, leveraging a local-future sliding window for denoising. To ensure temporal consistency over extended sequences, they introduce a dual-anchor temporal constraint mechanism. Integrated with RoPE-based positional re-indexing, anchor audio zero-padding, reuse of previously generated chunks, and a two-stage streaming distillation strategy, the method achieves real-time performance at 34 ms per frame on both standard and a newly introduced 400-frame long-sequence benchmark. Their 1.3B-parameter student model significantly outperforms existing approaches, setting new state-of-the-art results in lip-sync accuracy and visual quality.

Technology Category

Application Category

📝 Abstract
Real-time talking avatar generation requires low latency and minute-level temporal stability. Autoregressive (AR) forcing enables streaming inference but suffers from exposure bias, which causes errors to accumulate and become irreversible over long rollouts. In contrast, full-sequence diffusion transformers mitigate drift but remain computationally prohibitive for real-time long-form synthesis. We present AvatarForcing, a one-step streaming diffusion framework that denoises a fixed local-future window with heterogeneous noise levels and emits one clean block per step under constant per-step cost. To stabilize unbounded streams, the method introduces dual-anchor temporal forcing: a style anchor that re-indexes RoPE to maintain a fixed relative position with respect to the active window and applies anchor-audio zero-padding, and a temporal anchor that reuses recently emitted clean blocks to ensure smooth transitions. Real-time one-step inference is enabled by two-stage streaming distillation with offline ODE backfill and distribution matching. Experiments on standard benchmarks and a new 400-video long-form benchmark show strong visual quality and lip synchronization at 34 ms/frame using a 1.3B-parameter student model for realtime streaming. Our page is available at: https://cuiliyuan121.github.io/AvatarForcing/
Problem

Research questions and friction points this paper is trying to address.

talking avatar
real-time generation
temporal stability
exposure bias
streaming synthesis
Innovation

Methods, ideas, or system contributions that make the work stand out.

streaming talking avatars
one-step diffusion
local-future sliding window
dual-anchor temporal forcing
streaming distillation
🔎 Similar Papers
No similar papers found.
L
Liyuan Cui
Zhejiang University
Wentao Hu
Wentao Hu
PhD student, The Hong Kong Polytechnic University
Large Language ModelComputer Vision
Wenyuan Zhang
Wenyuan Zhang
Tsinghua University
3D Computer Vision3D ReconstructionVideo Generation
Z
Zesong Yang
Zhejiang University
F
Fan Shi
Kling Team, Kuaishou Technology
X
Xiaoqiang Liu
Kling Team, Kuaishou Technology