MIRRORTALK: Forging Personalized Avatars Via Disentangled Style and Hierarchical Motion Control

📅 2026-01-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of disentangling and transferring speaker-specific facial expression styles while maintaining precise lip synchronization. To this end, the authors propose MirrorTalk, a novel framework that employs a Semantic-Disentangled Style Encoder (SDSE) to extract clean speaker style representations from short reference videos. By integrating a conditional diffusion model with a hierarchical motion modulation strategy, MirrorTalk dynamically coordinates the influence of audio-driven content and speaker-specific style on different facial regions during generation. This approach achieves, for the first time, effective disentanglement between semantic content and speaker identity in facial motion, significantly outperforming state-of-the-art methods in both lip-sync accuracy and fidelity to personalized expressive styles.

Technology Category

Application Category

📝 Abstract
Synthesizing personalized talking faces that uphold and highlight a speaker's unique style while maintaining lip-sync accuracy remains a significant challenge. A primary limitation of existing approaches is the intrinsic confounding of speaker-specific talking style and semantic content within facial motions, which prevents the faithful transfer of a speaker's unique persona to arbitrary speech. In this paper, we propose MirrorTalk, a generative framework based on a conditional diffusion model, combined with a Semantically-Disentangled Style Encoder (SDSE) that can distill pure style representations from a brief reference video. To effectively utilize this representation, we further introduce a hierarchical modulation strategy within the diffusion process. This mechanism guides the synthesis by dynamically balancing the contributions of audio and style features across distinct facial regions, ensuring both precise lip-sync accuracy and expressive full-face dynamics. Extensive experiments demonstrate that MirrorTalk achieves significant improvements over state-of-the-art methods in terms of lip-sync accuracy and personalization preservation.
Problem

Research questions and friction points this paper is trying to address.

personalized talking faces
lip-sync accuracy
style disentanglement
facial motion
speaker-specific style
Innovation

Methods, ideas, or system contributions that make the work stand out.

disentangled style
hierarchical motion control
conditional diffusion model
personalized talking avatars
lip-sync accuracy
🔎 Similar Papers
No similar papers found.
R
Renjie Lu
Ping An Technology (Shenzhen) Co., Ltd., Shenzhen, China
Xulong Zhang
Xulong Zhang
Ping An Technology (Shenzhen) Co., Ltd.
Federated Large ModelsTrusted ComputingGraph Computing
X
Xiaoyang Qu
Ping An Technology (Shenzhen) Co., Ltd., Shenzhen, China
Jianzong Wang
Jianzong Wang
Postdoctoral Researcher of Department of Electrical and Computer Engineering, University of Florida
Big DataStorage SystemCloud Computing
S
Shangfei Wang
University of Science and Technology of China, Hefei, China