CEM-Net: Cross-Emotion Memory Network for Emotional Talking Face Generation

📅 2025-08-17

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Existing emotional talking-face generation methods suffer from emotion misalignment and visual artifacts when reference images and driving audio convey conflicting emotions. To address this, we propose the Cross-Emotion Memory Network (CEMN), which explicitly models speech emotion via an Audio Emotion Enhancement (AEE) module and decouples identity and emotion representations from the reference image using an Emotion Bridging Memory (EBM) module—enabling controllable facial expression transfer from highly emotional references to target audio-driven emotions. CEMN further incorporates a query-based memory mechanism and cross-reconstruction training to fuse multimodal emotion features, enhancing expression consistency and naturalness. Extensive evaluations on multiple benchmarks demonstrate significant improvements in emotion accuracy, lip-sync precision, and video quality—particularly under cross-emotion conditions—outperforming prior state-of-the-art methods.

Technology Category

Application Category

📝 Abstract

Emotional talking face generation aims to animate a human face in given reference images and generate a talking video that matches the content and emotion of driving audio. However, existing methods neglect that reference images may have a strong emotion that conflicts with the audio emotion, leading to severe emotion inaccuracy and distorted generated results. To tackle the issue, we introduce a cross-emotion memory network(CEM-Net), designed to generate emotional talking faces aligned with the driving audio when reference images exhibit strong emotion. Specifically, an Audio Emotion Enhancement module(AEE) is first devised with the cross-reconstruction training strategy to enhance audio emotion, overcoming the disruption from reference image emotion. Secondly, since reference images cannot provide sufficient facial motion information of the speaker under audio emotion, an Emotion Bridging Memory module(EBM) is utilized to compensate for the lacked information. It brings in expression displacement from the reference image emotion to the audio emotion and stores it in the memory.Given a cross-emotion feature as a query, the matching displacement can be retrieved at inference time. Extensive experiments have demonstrated that our CEM-Net can synthesize expressive, natural and lip-synced talking face videos with better emotion accuracy.

Problem

Research questions and friction points this paper is trying to address.

Resolves conflict between reference image and audio emotions

Enhances audio emotion to prevent distorted results

Compensates lacking facial motion from reference images

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-Emotion Memory Network for emotion alignment

Audio Emotion Enhancement module for emotion clarity

Emotion Bridging Memory compensates motion information

🔎 Similar Papers

DEEPTalk: Dynamic Emotion Embedding for Probabilistic Speech-Driven 3D Face Animation

2024-08-12arXiv.orgCitations: 0

Apple

Cupertino, United States of America

AI Research Scientist, Computer Vision - Facebook Video Intelligence