Wav2Sem: Plug-and-Play Audio Semantic Decoupling for 3D Speech-Driven Facial Animation

📅 2025-05-29

📈 Citations: 0

✨ Influential: 0

career value

234K/year

🤖 AI Summary

In speech-driven 3D facial animation, phonetically similar syllables become coupled in self-supervised audio feature spaces, leading to blurred lip articulation. To address this, we propose Wav2Sem—a plug-and-play audio semantic disentanglement module. Leveraging pre-trained audio models (e.g., Wav2Vec 2.0), Wav2Sem extracts temporal audio features and introduces semantic-guided contrastive learning alongside feature orthogonality constraints—enabling effective disentanglement of syllables that are acoustically similar yet exhibit distinct lip movements. Crucially, Wav2Sem requires no modification or retraining of the underlying animation framework, ensuring lightweight integration and strong compatibility across architectures. Evaluated on multiple state-of-the-art speech-driven frameworks, Wav2Sem reduces lip landmark error by 18.7% and improves subjective naturalness by 22%, significantly mitigating the lip-averaging effect.

Technology Category

Application Category

📝 Abstract

In 3D speech-driven facial animation generation, existing methods commonly employ pre-trained self-supervised audio models as encoders. However, due to the prevalence of phonetically similar syllables with distinct lip shapes in language, these near-homophone syllables tend to exhibit significant coupling in self-supervised audio feature spaces, leading to the averaging effect in subsequent lip motion generation. To address this issue, this paper proposes a plug-and-play semantic decorrelation module-Wav2Sem. This module extracts semantic features corresponding to the entire audio sequence, leveraging the added semantic information to decorrelate audio encodings within the feature space, thereby achieving more expressive audio features. Extensive experiments across multiple Speech-driven models indicate that the Wav2Sem module effectively decouples audio features, significantly alleviating the averaging effect of phonetically similar syllables in lip shape generation, thereby enhancing the precision and naturalness of facial animations. Our source code is available at https://github.com/wslh852/Wav2Sem.git.

Problem

Research questions and friction points this paper is trying to address.

Decoupling phonetically similar syllables in audio features

Reducing averaging effect in lip motion generation

Enhancing precision and naturalness of facial animations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Plug-and-play semantic decorrelation module

Decouples audio features for expressive animations

Enhances precision and naturalness of lip shapes

🔎 Similar Papers

No similar papers found.