UniLS: End-to-End Audio-Driven Avatars for Unified Listening and Speaking

📅 2025-12-10

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

To address the limited dynamism and rigidity in listener facial expressions generated by existing audio-driven virtual human methods, this paper proposes the first end-to-end dual-track audio-driven framework for unified speaker–listener facial expression generation. Methodologically, we introduce a two-stage training paradigm: (1) unsupervised autoregressive modeling of facial motion priors without audio input, followed by (2) joint modulation using speaker and listener audio streams. This enables unified facial expression synthesis from audio alone—without requiring explicit motion annotations or auxiliary supervision. Our key contribution is the first principled decoupling and co-modeling of speaker and listener dynamics, significantly enhancing listening naturalness: the listening fidelity metric improves by 44.1%, while motion diversity and temporal coherence achieve state-of-the-art performance. Crucially, speaker expression accuracy remains competitive with leading approaches.

Technology Category

Application Category

📝 Abstract

Generating lifelike conversational avatars requires modeling not just isolated speakers, but the dynamic, reciprocal interaction of speaking and listening. However, modeling the listener is exceptionally challenging: direct audio-driven training fails, producing stiff, static listening motions. This failure stems from a fundamental imbalance: the speaker's motion is strongly driven by speech audio, while the listener's motion primarily follows an internal motion prior and is only loosely guided by external speech. This challenge has led most methods to focus on speak-only generation. The only prior attempt at joint generation relies on extra speaker's motion to produce the listener. This design is not end-to-end, thereby hindering the real-time applicability. To address this limitation, we present UniLS, the first end-to-end framework for generating unified speak-listen expressions, driven by only dual-track audio. Our method introduces a novel two-stage training paradigm. Stage 1 first learns the internal motion prior by training an audio-free autoregressive generator, capturing the spontaneous dynamics of natural facial motion. Stage 2 then introduces the dual-track audio, fine-tuning the generator to modulate the learned motion prior based on external speech cues. Extensive evaluations show UniLS achieves state-of-the-art speaking accuracy. More importantly, it delivers up to 44.1% improvement in listening metrics, generating significantly more diverse and natural listening expressions. This effectively mitigates the stiffness problem and provides a practical, high-fidelity audio-driven solution for interactive digital humans.

Problem

Research questions and friction points this paper is trying to address.

Generates unified speaking and listening expressions from dual-track audio

Overcomes stiffness in listener motions by learning internal motion prior

Enables real-time interactive avatars without extra speaker motion input

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage training with audio-free autoregressive generator

Dual-track audio-driven modulation of learned motion prior

End-to-end framework for unified speak-listen expressions

🔎 Similar Papers

No similar papers found.