REST: Diffusion-based Real-time End-to-end Streaming Talking Head Generation via ID-Context Caching and Asynchronous Streaming Distillation

📅 2025-12-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Diffusion models for talking-head generation suffer from slow inference and error accumulation due to their non-autoregressive nature. This paper introduces the first real-time, end-to-end streaming speech-driven talking-head framework. Our method addresses these limitations via two core innovations: (1) an ID-Context Cache mechanism that enables identity-aware key-value caching and autoregressive streaming generation within a compact video latent space; and (2) Asynchronous Streaming Distillation (ASD), a novel training strategy integrating asynchronous noise scheduling with a high spatiotemporal-resolution VAE to mitigate temporal error propagation. Evaluated under strict real-time constraints (<100 ms end-to-end latency), our approach achieves significant improvements in visual quality, temporal coherence, and identity fidelity—outperforming all existing state-of-the-art methods across quantitative and qualitative benchmarks.

Technology Category

Application Category

📝 Abstract
Diffusion models have significantly advanced the field of talking head generation. However, the slow inference speeds and non-autoregressive paradigms severely constrain the application of diffusion-based THG models. In this study, we propose REST, the first diffusion-based, real-time, end-to-end streaming audio-driven talking head generation framework. To support real-time end-to-end generation, a compact video latent space is first learned through high spatiotemporal VAE compression. Additionally, to enable autoregressive streaming within the compact video latent space, we introduce an ID-Context Cache mechanism, which integrates ID-Sink and Context-Cache principles to key-value caching for maintaining temporal consistency and identity coherence during long-time streaming generation. Furthermore, an Asynchronous Streaming Distillation (ASD) training strategy is proposed to mitigate error accumulation in autoregressive generation and enhance temporal consistency, which leverages a non-streaming teacher with an asynchronous noise schedule to supervise the training of the streaming student model. REST bridges the gap between autoregressive and diffusion-based approaches, demonstrating substantial value for applications requiring real-time talking head generation. Experimental results demonstrate that REST outperforms state-of-the-art methods in both generation speed and overall performance.
Problem

Research questions and friction points this paper is trying to address.

Real-time streaming talking head generation
Autoregressive diffusion model efficiency
Temporal consistency in long-time streaming
Innovation

Methods, ideas, or system contributions that make the work stand out.

Compact video latent space via VAE compression
ID-Context Cache for temporal consistency
Asynchronous Streaming Distillation to reduce errors
🔎 Similar Papers
No similar papers found.
H
Haotian Wang
University of Science and Technology of China, China
Y
Yuzhe Weng
University of Science and Technology of China, China
X
Xinyi Yu
University of Science and Technology of China, China
J
Jun Du
University of Science and Technology of China, China
H
Haoran Xu
iFLYTEK, China
X
Xiaoyan Wu
iFLYTEK, China
S
Shan He
iFLYTEK, China
Bing Yin
Bing Yin
Amazon.com
NLPInformation RetrievalDeep LearningKnowledge Graphs
C
Cong Liu
iFLYTEK, China
Qingfeng Liu
Qingfeng Liu
Professor, Hosei University
Econometrics