CapTalk: Text-Guided Stylization and Speech-Driven 3D Head Animation

📅 2026-05-27

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

This work addresses the limited controllability of speaking style and the inability to dynamically modulate facial expressions according to vocal emotion in existing speech-driven 3D facial animation methods. We propose a text-guided multimodal generative model that, for the first time, enables disentangled control over speaking style and emotion. By integrating acoustic signals with textual descriptions of both style and emotion, our approach supports real-time, independent adjustment of these attributes during inference. To facilitate this research, we introduce the first large-scale audio-visual dataset annotated with textual labels for diverse speaking styles and emotional states. The generated 3D talking heads achieve accurate lip synchronization while faithfully rendering user-specified stylistic and emotional dynamics, substantially enhancing both expressiveness and controllability of facial animation.

📝 Abstract

Audio-driven 3D facial animation aims to generate synchronized lip movements and vivid facial expressions from arbitrary audio clips. While existing methods can produce synchronized lip motions, they often rely on predefined identity or style latent features, which limits users' ability to freely control speaking styles. Moreover, applying a fixed style or identity to an entire audio segment typically results in facial animation styles that do not adapt to the emotional content of the audio. To address these challenges, we revisit the entanglement between style and emotion, construct a large-scale dataset with textual descriptions of both style and emotion, and propose a novel talking head generation framework that enables separate control over style and emotion. Our model takes as input both textual descriptions of speaking style and character emotion, as well as the driving audio stream, enabling real-time generation of highly synchronized lip movements and facial expressions that match the provided descriptions. Furthermore, our model supports dynamic emotion control during inference, allowing it to handle scenarios where the target emotion changes throughout the speech.

Problem

Research questions and friction points this paper is trying to address.

audio-driven animation

talking head

style control

emotion adaptation

facial animation

Innovation

Methods, ideas, or system contributions that make the work stand out.

text-guided stylization

speech-driven animation

style-emotion disentanglement