UniLS: End-to-End Audio-Driven Avatars for Unified Listening and Speaking

📅 2025-12-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the limited dynamism and rigidity in listener facial expressions generated by existing audio-driven virtual human methods, this paper proposes the first end-to-end dual-track audio-driven framework for unified speaker–listener facial expression generation. Methodologically, we introduce a two-stage training paradigm: (1) unsupervised autoregressive modeling of facial motion priors without audio input, followed by (2) joint modulation using speaker and listener audio streams. This enables unified facial expression synthesis from audio alone—without requiring explicit motion annotations or auxiliary supervision. Our key contribution is the first principled decoupling and co-modeling of speaker and listener dynamics, significantly enhancing listening naturalness: the listening fidelity metric improves by 44.1%, while motion diversity and temporal coherence achieve state-of-the-art performance. Crucially, speaker expression accuracy remains competitive with leading approaches.

Technology Category

Application Category

📝 Abstract
Generating lifelike conversational avatars requires modeling not just isolated speakers, but the dynamic, reciprocal interaction of speaking and listening. However, modeling the listener is exceptionally challenging: direct audio-driven training fails, producing stiff, static listening motions. This failure stems from a fundamental imbalance: the speaker's motion is strongly driven by speech audio, while the listener's motion primarily follows an internal motion prior and is only loosely guided by external speech. This challenge has led most methods to focus on speak-only generation. The only prior attempt at joint generation relies on extra speaker's motion to produce the listener. This design is not end-to-end, thereby hindering the real-time applicability. To address this limitation, we present UniLS, the first end-to-end framework for generating unified speak-listen expressions, driven by only dual-track audio. Our method introduces a novel two-stage training paradigm. Stage 1 first learns the internal motion prior by training an audio-free autoregressive generator, capturing the spontaneous dynamics of natural facial motion. Stage 2 then introduces the dual-track audio, fine-tuning the generator to modulate the learned motion prior based on external speech cues. Extensive evaluations show UniLS achieves state-of-the-art speaking accuracy. More importantly, it delivers up to 44.1% improvement in listening metrics, generating significantly more diverse and natural listening expressions. This effectively mitigates the stiffness problem and provides a practical, high-fidelity audio-driven solution for interactive digital humans.
Problem

Research questions and friction points this paper is trying to address.

Generates unified speaking and listening expressions from dual-track audio
Overcomes stiffness in listener motions by learning internal motion prior
Enables real-time interactive avatars without extra speaker motion input
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage training with audio-free autoregressive generator
Dual-track audio-driven modulation of learned motion prior
End-to-end framework for unified speak-listen expressions
🔎 Similar Papers
No similar papers found.
Xuangeng Chu
Xuangeng Chu
The University of Tokyo
3D Computer VisionVirtual HumansDigital Humans
Ruicong Liu
Ruicong Liu
The University of Tokyo
computer vision
Y
Yifei Huang
Shanda AI Research Tokyo, The University of Tokyo
Y
Yun Liu
Shanda AI Research Tokyo, The University of Tokyo
Yichen Peng
Yichen Peng
Tokyo Institute of Technology
Computer GraphicsHCIMachine Learning
B
Bo Zheng
Shanda AI Research Tokyo, The University of Tokyo