MIRRORTALK: Forging Personalized Avatars Via Disentangled Style and Hierarchical Motion Control

📅 2026-01-30

📈 Citations: 0

✨ Influential: 0

career value

233K/year

🤖 AI Summary

This work addresses the challenge of disentangling and transferring speaker-specific facial expression styles while maintaining precise lip synchronization. To this end, the authors propose MirrorTalk, a novel framework that employs a Semantic-Disentangled Style Encoder (SDSE) to extract clean speaker style representations from short reference videos. By integrating a conditional diffusion model with a hierarchical motion modulation strategy, MirrorTalk dynamically coordinates the influence of audio-driven content and speaker-specific style on different facial regions during generation. This approach achieves, for the first time, effective disentanglement between semantic content and speaker identity in facial motion, significantly outperforming state-of-the-art methods in both lip-sync accuracy and fidelity to personalized expressive styles.

Technology Category

Application Category

📝 Abstract

Synthesizing personalized talking faces that uphold and highlight a speaker's unique style while maintaining lip-sync accuracy remains a significant challenge. A primary limitation of existing approaches is the intrinsic confounding of speaker-specific talking style and semantic content within facial motions, which prevents the faithful transfer of a speaker's unique persona to arbitrary speech. In this paper, we propose MirrorTalk, a generative framework based on a conditional diffusion model, combined with a Semantically-Disentangled Style Encoder (SDSE) that can distill pure style representations from a brief reference video. To effectively utilize this representation, we further introduce a hierarchical modulation strategy within the diffusion process. This mechanism guides the synthesis by dynamically balancing the contributions of audio and style features across distinct facial regions, ensuring both precise lip-sync accuracy and expressive full-face dynamics. Extensive experiments demonstrate that MirrorTalk achieves significant improvements over state-of-the-art methods in terms of lip-sync accuracy and personalization preservation.

Problem

Research questions and friction points this paper is trying to address.

personalized talking faces

lip-sync accuracy

style disentanglement

facial motion

speaker-specific style

Innovation

Methods, ideas, or system contributions that make the work stand out.

disentangled style

hierarchical motion control

conditional diffusion model