StreamingTalker: Audio-driven 3D Facial Animation with Autoregressive Diffusion Model

๐Ÿ“… 2025-11-18
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Speech-driven 3D facial animation suffers from high latency and poor generalization when processing long audio inputs. To address this, we propose the first autoregressive diffusion model designed for streaming generation, which conditions on dynamically updated local history frames to synthesize arbitrary-length facial motion sequences with low latency. Our method integrates audio-conditioned diffusion with an autoregressive framework, enabling dynamic modeling of historical motion contextโ€”thereby overcoming fundamental limitations of conventional non-streaming models in temporal modeling and real-time inference. Experiments demonstrate that our model achieves high-fidelity expression details while maintaining constant inference latency independent of audio duration, significantly outperforming existing approaches. The system has been successfully deployed as a real-time interactive platform.

Technology Category

Application Category

๐Ÿ“ Abstract
This paper focuses on the task of speech-driven 3D facial animation, which aims to generate realistic and synchronized facial motions driven by speech inputs.Recent methods have employed audio-conditioned diffusion models for 3D facial animation, achieving impressive results in generating expressive and natural animations.However, these methods process the whole audio sequences in a single pass, which poses two major challenges: they tend to perform poorly when handling audio sequences that exceed the training horizon and will suffer from significant latency when processing long audio inputs. To address these limitations, we propose a novel autoregressive diffusion model that processes input audio in a streaming manner. This design ensures flexibility with varying audio lengths and achieves low latency independent of audio duration. Specifically, we select a limited number of past frames as historical motion context and combine them with the audio input to create a dynamic condition. This condition guides the diffusion process to iteratively generate facial motion frames, enabling real-time synthesis with high-quality results. Additionally, we implemented a real-time interactive demo, highlighting the effectiveness and efficiency of our approach. We will release the code at https://zju3dv.github.io/StreamingTalker/.
Problem

Research questions and friction points this paper is trying to address.

Generating real-time 3D facial animation from speech
Overcoming latency issues with long audio sequences
Ensuring flexibility for varying audio input lengths
Innovation

Methods, ideas, or system contributions that make the work stand out.

Autoregressive diffusion model for facial animation
Streaming audio processing for low latency
Dynamic historical context guides motion generation
๐Ÿ”Ž Similar Papers
No similar papers found.
Y
Yifan Yang
State Key Laboratory of CAD&CG, Zhejiang University
Zhi Cen
Zhi Cen
Zhejiang University
Computer vision
Sida Peng
Sida Peng
Zhejiang University
Computer VisionComputer Graphics
X
Xiangwei Chen
College of Computer Science, Zhejiang University
Y
Yifu Deng
Ant Group
X
Xinyu Zhu
Ant Group
Fan Jia
Fan Jia
Faculty of Chemistry and Biochemistry, Ruhr-University of Bochum
Organic Chemistry
Xiaowei Zhou
Xiaowei Zhou
Professor of Computer Science, Zhejiang University
Computer VisionComputer Graphics
H
Hujun Bao
State Key Laboratory of CAD&CG, Zhejiang University