Think-Before-Draw: Decomposing Emotion Semantics & Fine-Grained Controllable Expressive Talking Head Generation

📅 2025-07-16

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Existing text-driven emotional talking-head methods rely on discrete emotion labels, failing to capture the dynamic continuity of real facial muscle movements and resulting in僵硬 expressions. To address this, we propose a fine-grained controllable emotional talking-head generation framework. First, a multimodal large language model parses textual emotional semantics, and—via chain-of-thought reasoning—maps them to physiologically grounded facial muscle action descriptions. Second, we design a progressive hierarchical diffusion denoising mechanism that refines modeling from global emotion localization to local muscle dynamics. Our method achieves state-of-the-art performance on the MEAD and HDTF benchmarks, exhibits strong zero-shot generalization, and accurately reconstructs diverse micro-expression dynamics. It significantly enhances expression naturalness and expressiveness while preserving temporal coherence and anatomical plausibility.

Technology Category

Application Category

📝 Abstract

Emotional talking-head generation has emerged as a pivotal research area at the intersection of computer vision and multimodal artificial intelligence, with its core value lying in enhancing human-computer interaction through immersive and empathetic engagement.With the advancement of multimodal large language models, the driving signals for emotional talking-head generation has shifted from audio and video to more flexible text. However, current text-driven methods rely on predefined discrete emotion label texts, oversimplifying the dynamic complexity of real facial muscle movements and thus failing to achieve natural emotional expressiveness.This study proposes the Think-Before-Draw framework to address two key challenges: (1) In-depth semantic parsing of emotions--by innovatively introducing Chain-of-Thought (CoT), abstract emotion labels are transformed into physiologically grounded facial muscle movement descriptions, enabling the mapping from high-level semantics to actionable motion features; and (2) Fine-grained expressiveness optimization--inspired by artists' portrait painting process, a progressive guidance denoising strategy is proposed, employing a "global emotion localization--local muscle control" mechanism to refine micro-expression dynamics in generated videos.Our experiments demonstrate that our approach achieves state-of-the-art performance on widely-used benchmarks, including MEAD and HDTF. Additionally, we collected a set of portrait images to evaluate our model's zero-shot generation capability.

Problem

Research questions and friction points this paper is trying to address.

Decomposing emotion semantics into facial muscle movements

Achieving fine-grained control over expressive talking heads

Enhancing natural emotional expressiveness in generated videos

Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain-of-Thought transforms emotions to muscle movements

Progressive guidance denoising refines micro-expression dynamics

Global-local mechanism controls facial muscle actions

🔎 Similar Papers

EmoVOCA: Speech-Driven Emotional 3D Talking Heads

2024-03-19IEEE Workshop/Winter Conference on Applications of Computer VisionCitations: 4

DEEPTalk: Dynamic Emotion Embedding for Probabilistic Speech-Driven 3D Face Animation

2024-08-12arXiv.orgCitations: 0