π€ AI Summary
This work addresses the challenge of accurately modeling the correspondence between acoustic signals and lip movements in speech-driven facial animation, a task often hindered by insufficient exploitation of linguistic and phonemic structures underlying speech. To this end, the study introduces language priors from multimodal large language models into the animation pipeline for the first time, formulating the prediction of mouth blendshape coefficients as a structured generation process guided by both textual transcripts and phoneme-level articulatory cues. By integrating speechβtext alignment, phoneme feature extraction, and prior knowledge from large models, the proposed approach substantially enhances both the accuracy and interpretability of synthesized facial animations. Experimental results demonstrate clear superiority over existing methods across multiple evaluation metrics, underscoring the critical role of linguistic structure and multimodal priors in speech-driven facial animation.
π Abstract
Speech-driven facial animation requires accurate correspondence between acoustic signals and facial motion, especially for articulation-related mouth movements. However, directly mapping speech audio to facial coefficients often overlooks the linguistic and phonetic structure underlying speech production. In this paper, we propose AudioFace, a language-assisted framework for speech-driven blendshape generation that treats mouth-related facial coefficient prediction as a structured generation problem guided by linguistic and articulatory information. Instead of relying solely on acoustic features, our method leverages the prior knowledge of multimodal large language models and introduces transcript- and phoneme-level cues to bridge speech signals with interpretable facial actions. Extensive experiments show that AudioFace achieves superior performance across multiple evaluation metrics, validating the effectiveness of language-assisted and multimodal-prior-guided speech-driven facial animation.