HM-Talker: Hybrid Motion Modeling for High-Fidelity Talking Head Synthesis

📅 2025-08-14

📈 Citations: 0

✨ Influential: 0

career value

237K/year

🤖 AI Summary

Existing audio-driven talking-head generation methods often suffer from lip jitter and motion blur due to implicit modeling of audio–facial motion correlations. To address this, we propose the Hybrid Motion Modeling Framework (HMMF), the first approach to explicitly incorporate anatomy-inspired Action Units (AUs) as phoneme-aware motion priors. HMMF integrates a Cross-Modal Disentanglement Module (CMDM) to enforce precise audio–visual alignment and introduces a dynamic feature fusion mechanism that jointly leverages explicit muscle-group dynamics and implicit representations. The framework is identity-agnostic and highly robust, significantly improving lip-sync accuracy and temporal coherence. Evaluated on multiple benchmarks, HMMF outperforms state-of-the-art methods, enabling cross-subject generalization and personalized talking-head synthesis while effectively mitigating motion blur and lip jitter.

Technology Category

Application Category

📝 Abstract

Audio-driven talking head video generation enhances user engagement in human-computer interaction. However, current methods frequently produce videos with motion blur and lip jitter, primarily due to their reliance on implicit modeling of audio-facial motion correlations--an approach lacking explicit articulatory priors (i.e., anatomical guidance for speech-related facial movements). To overcome this limitation, we propose HM-Talker, a novel framework for generating high-fidelity, temporally coherent talking heads. HM-Talker leverages a hybrid motion representation combining both implicit and explicit motion cues. Explicit cues use Action Units (AUs), anatomically defined facial muscle movements, alongside implicit features to minimize phoneme-viseme misalignment. Specifically, our Cross-Modal Disentanglement Module (CMDM) extracts complementary implicit/explicit motion features while predicting AUs directly from audio input aligned to visual cues. To mitigate identity-dependent biases in explicit features and enhance cross-subject generalization, we introduce the Hybrid Motion Modeling Module (HMMM). This module dynamically merges randomly paired implicit/explicit features, enforcing identity-agnostic learning. Together, these components enable robust lip synchronization across diverse identities, advancing personalized talking head synthesis. Extensive experiments demonstrate HM-Talker's superiority over state-of-the-art methods in visual quality and lip-sync accuracy.

Problem

Research questions and friction points this paper is trying to address.

Generating high-fidelity talking head videos without motion blur

Overcoming lip jitter from implicit audio-facial motion modeling

Enhancing cross-subject generalization in personalized synthesis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid motion modeling with implicit and explicit cues

Cross-modal disentanglement module for audio-visual alignment

Dynamic feature merging for identity-agnostic learning

🔎 Similar Papers

EmoVOCA: Speech-Driven Emotional 3D Talking Heads