AudioFace: Language-Assisted Speech-Driven Facial Animation with Multimodal Language Models

πŸ“… 2026-05-08
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

197K/year
πŸ€– AI Summary
This work addresses the challenge of accurately modeling the correspondence between acoustic signals and lip movements in speech-driven facial animation, a task often hindered by insufficient exploitation of linguistic and phonemic structures underlying speech. To this end, the study introduces language priors from multimodal large language models into the animation pipeline for the first time, formulating the prediction of mouth blendshape coefficients as a structured generation process guided by both textual transcripts and phoneme-level articulatory cues. By integrating speech–text alignment, phoneme feature extraction, and prior knowledge from large models, the proposed approach substantially enhances both the accuracy and interpretability of synthesized facial animations. Experimental results demonstrate clear superiority over existing methods across multiple evaluation metrics, underscoring the critical role of linguistic structure and multimodal priors in speech-driven facial animation.
πŸ“ Abstract
Speech-driven facial animation requires accurate correspondence between acoustic signals and facial motion, especially for articulation-related mouth movements. However, directly mapping speech audio to facial coefficients often overlooks the linguistic and phonetic structure underlying speech production. In this paper, we propose AudioFace, a language-assisted framework for speech-driven blendshape generation that treats mouth-related facial coefficient prediction as a structured generation problem guided by linguistic and articulatory information. Instead of relying solely on acoustic features, our method leverages the prior knowledge of multimodal large language models and introduces transcript- and phoneme-level cues to bridge speech signals with interpretable facial actions. Extensive experiments show that AudioFace achieves superior performance across multiple evaluation metrics, validating the effectiveness of language-assisted and multimodal-prior-guided speech-driven facial animation.
Problem

Research questions and friction points this paper is trying to address.

speech-driven facial animation
linguistic structure
articulatory information
facial motion
audio-to-face correspondence
Innovation

Methods, ideas, or system contributions that make the work stand out.

language-assisted
multimodal language models
speech-driven facial animation
phoneme-level cues
structured generation
πŸ”Ž Similar Papers
No similar papers found.