Disentangle Identity, Cooperate Emotion: Correlation-Aware Emotional Talking Portrait Generation

📅 2025-04-25

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

Existing talking-head generation methods struggle to simultaneously achieve lip-sync accuracy, visual fidelity, and natural emotional expression—particularly while preserving speaker identity. To address this, we propose an identity-emotion disentanglement framework: (1) a speaker-agnostic Gaussian emotion embedding; (2) a learnable emotion bank coupled with a correlation-enhanced conditional module; and (3) a latent-space emotion discrimination objective—enabling, for the first time, unified emotion representation disentanglement and cross-emotion collaborative modeling. Our method integrates diffusion models, cross-modal attention, and vector quantization. On MEAD and HDTF benchmarks, it achieves state-of-the-art emotion classification accuracy while maintaining top-tier lip-sync performance (measured by LSE and SyncNet scores). User studies confirm that generated videos exhibit strong speaker identity preservation alongside rich, natural emotional expressiveness.

Technology Category

Application Category

📝 Abstract

Recent advances in Talking Head Generation (THG) have achieved impressive lip synchronization and visual quality through diffusion models; yet existing methods struggle to generate emotionally expressive portraits while preserving speaker identity. We identify three critical limitations in current emotional talking head generation: insufficient utilization of audio's inherent emotional cues, identity leakage in emotion representations, and isolated learning of emotion correlations. To address these challenges, we propose a novel framework dubbed as DICE-Talk, following the idea of disentangling identity with emotion, and then cooperating emotions with similar characteristics. First, we develop a disentangled emotion embedder that jointly models audio-visual emotional cues through cross-modal attention, representing emotions as identity-agnostic Gaussian distributions. Second, we introduce a correlation-enhanced emotion conditioning module with learnable Emotion Banks that explicitly capture inter-emotion relationships through vector quantization and attention-based feature aggregation. Third, we design an emotion discrimination objective that enforces affective consistency during the diffusion process through latent-space classification. Extensive experiments on MEAD and HDTF datasets demonstrate our method's superiority, outperforming state-of-the-art approaches in emotion accuracy while maintaining competitive lip-sync performance. Qualitative results and user studies further confirm our method's ability to generate identity-preserving portraits with rich, correlated emotional expressions that naturally adapt to unseen identities.

Problem

Research questions and friction points this paper is trying to address.

Generate emotionally expressive portraits preserving speaker identity

Utilize audio's emotional cues and prevent identity leakage

Model inter-emotion relationships for correlated emotional expressions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Disentangled emotion embedder with cross-modal attention

Correlation-enhanced emotion conditioning with Emotion Banks

Emotion discrimination objective for affective consistency

🔎 Similar Papers

GMTalker: Gaussian Mixture-based Audio-Driven Emotional Talking Video Portraits