MoCoTalk: Multi-Conditional Diffusion with Adaptive Router for Controllable Talking Head Generation

📅 2026-05-08

📈 Citations: 0

✨ Influential: 0

career value

233K/year

🤖 AI Summary

Existing methods for talking-head generation struggle to jointly model identity, head pose, facial expression, and lip motion, often suffering from condition interference due to fixed fusion weights. This work proposes MoCoTalk, a video diffusion framework that unifies four control signals—reference image, facial keypoints, 3DMM-shaded mesh, and audio—through an adaptive multi-condition routing mechanism enabling channel-wise, timestep-aware dynamic fusion. To disentangle head motion, lip articulation, expression, and illumination, we introduce a mouth-enhanced 3DMM representation and incorporate a lip consistency loss, significantly improving generation quality. Experiments demonstrate that MoCoTalk outperforms state-of-the-art approaches across structural, motion, and perceptual metrics, while enabling fine-grained attribute control and flexible composition of facial dynamics.

📝 Abstract

Talking-head generation requires joint modeling of identity, head pose, facial expression, and mouth dynamics. Existing methods typically address only a subset of these factors, and rely on fixed-weight or heuristic fusion when multiple conditions are involved. We present MoCoTalk, a multi-conditional video diffusion framework that unifies four complementary control signals: a reference image, facial keypoints, 3DMM-rendered shading meshes, and the corresponding speech audio. To resolve destructive interference among heterogeneous conditions, we introduce an Adaptive Multi-Condition Router that computes channel-wise, timestep-aware gating over the four condition streams, allowing the fusion strategy to vary with both feature subspace and noise level. To better capture speech-related facial dynamics, we design a Mouth-Augmented Shading Mesh, a 3DMM-based representation that decouples head motion, mouth motion, expression, and lighting. This design provides a temporally consistent geometric prior and allows flexible recombination of these attributes at inference. We further introduce a lip consistency loss to tighten audio-visual alignment. Extensive experiments show that MoCoTalk achieves state-of-the-art performance on the majority of structural, motion, and perceptual metrics, while offering attribute-level controllability that single-condition methods do not provide.

Problem

Research questions and friction points this paper is trying to address.

talking-head generation

multi-condition fusion

facial dynamics

audio-visual alignment

controllable synthesis

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-conditional diffusion

adaptive router

talking head generation