Think-Before-Draw: Decomposing Emotion Semantics & Fine-Grained Controllable Expressive Talking Head Generation

📅 2025-07-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing text-driven emotional talking-head methods rely on discrete emotion labels, failing to capture the dynamic continuity of real facial muscle movements and resulting in僵硬 expressions. To address this, we propose a fine-grained controllable emotional talking-head generation framework. First, a multimodal large language model parses textual emotional semantics, and—via chain-of-thought reasoning—maps them to physiologically grounded facial muscle action descriptions. Second, we design a progressive hierarchical diffusion denoising mechanism that refines modeling from global emotion localization to local muscle dynamics. Our method achieves state-of-the-art performance on the MEAD and HDTF benchmarks, exhibits strong zero-shot generalization, and accurately reconstructs diverse micro-expression dynamics. It significantly enhances expression naturalness and expressiveness while preserving temporal coherence and anatomical plausibility.

Technology Category

Application Category

📝 Abstract
Emotional talking-head generation has emerged as a pivotal research area at the intersection of computer vision and multimodal artificial intelligence, with its core value lying in enhancing human-computer interaction through immersive and empathetic engagement.With the advancement of multimodal large language models, the driving signals for emotional talking-head generation has shifted from audio and video to more flexible text. However, current text-driven methods rely on predefined discrete emotion label texts, oversimplifying the dynamic complexity of real facial muscle movements and thus failing to achieve natural emotional expressiveness.This study proposes the Think-Before-Draw framework to address two key challenges: (1) In-depth semantic parsing of emotions--by innovatively introducing Chain-of-Thought (CoT), abstract emotion labels are transformed into physiologically grounded facial muscle movement descriptions, enabling the mapping from high-level semantics to actionable motion features; and (2) Fine-grained expressiveness optimization--inspired by artists' portrait painting process, a progressive guidance denoising strategy is proposed, employing a "global emotion localization--local muscle control" mechanism to refine micro-expression dynamics in generated videos.Our experiments demonstrate that our approach achieves state-of-the-art performance on widely-used benchmarks, including MEAD and HDTF. Additionally, we collected a set of portrait images to evaluate our model's zero-shot generation capability.
Problem

Research questions and friction points this paper is trying to address.

Decomposing emotion semantics into facial muscle movements
Achieving fine-grained control over expressive talking heads
Enhancing natural emotional expressiveness in generated videos
Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain-of-Thought transforms emotions to muscle movements
Progressive guidance denoising refines micro-expression dynamics
Global-local mechanism controls facial muscle actions
🔎 Similar Papers
No similar papers found.