Unlock Pose Diversity: Accurate and Efficient Implicit Keypoint-based Spatiotemporal Diffusion for Audio-driven Talking Portrait

📅 2025-03-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses three key challenges in audio-driven talking-head generation from a single image: identity distortion, loss of facial details, and insufficient head pose diversity. We propose the first framework integrating unsupervised implicit 3D keypoint extraction with a lightweight spatiotemporal diffusion model. Departing from conventional 3D Morphable Model (3DMM) approaches constrained by fixed facial topology, our method adaptively models facial information density and employs a customized spatiotemporal attention mechanism to jointly enhance detail fidelity and pose diversity while preserving subject identity. Quantitative and qualitative evaluations demonstrate state-of-the-art performance in lip-sync accuracy, pose diversity, and inference efficiency—significantly outperforming existing keypoint-driven and end-to-end image generation methods.

Technology Category

Application Category

📝 Abstract
Audio-driven single-image talking portrait generation plays a crucial role in virtual reality, digital human creation, and filmmaking. Existing approaches are generally categorized into keypoint-based and image-based methods. Keypoint-based methods effectively preserve character identity but struggle to capture fine facial details due to the fixed points limitation of the 3D Morphable Model. Moreover, traditional generative networks face challenges in establishing causality between audio and keypoints on limited datasets, resulting in low pose diversity. In contrast, image-based approaches produce high-quality portraits with diverse details using the diffusion network but incur identity distortion and expensive computational costs. In this work, we propose KDTalker, the first framework to combine unsupervised implicit 3D keypoint with a spatiotemporal diffusion model. Leveraging unsupervised implicit 3D keypoints, KDTalker adapts facial information densities, allowing the diffusion process to model diverse head poses and capture fine facial details flexibly. The custom-designed spatiotemporal attention mechanism ensures accurate lip synchronization, producing temporally consistent, high-quality animations while enhancing computational efficiency. Experimental results demonstrate that KDTalker achieves state-of-the-art performance regarding lip synchronization accuracy, head pose diversity, and execution efficiency.Our codes are available at https://github.com/chaolongy/KDTalker.
Problem

Research questions and friction points this paper is trying to address.

Enhance pose diversity in audio-driven talking portraits
Improve facial detail capture and identity preservation
Optimize computational efficiency and lip synchronization accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines unsupervised implicit 3D keypoints
Uses spatiotemporal diffusion model
Custom spatiotemporal attention mechanism
🔎 Similar Papers
No similar papers found.
Chaolong Yang
Chaolong Yang
University of Liverpool, PhD Candidate
Talking Portrait Synthesis
K
Kai Yao
Ant Group, Hangzhou, 310000, China.
Yuyao Yan
Yuyao Yan
Xi'an Jiaotong-Liverpool University
C
Chenru Jiang
Digital Innovation Research Center, Duke Kunshan University, Kunshan, 215316, China.
Weiguang Zhao
Weiguang Zhao
Univeristy of Liverpool, PhD Candidate
3D VisionEmbodied AIOpen World
J
Jie Sun
Department of Mechatronics and Robotics, Xi’an Jiaotong-Liverpool University, Suzhou, 215123, China.
Guangliang Cheng
Guangliang Cheng
Reader (Associate Professor) in University of Liverpool
Computer VisionDeepfake DetectionAutonomous DrivingRobotics
Y
Yifei Zhang
Ricoh Software Research Center, Beijing, 100027, China.
B
Bin Dong
Digital Innovation Research Center, Duke Kunshan University, Kunshan, 215316, China.
Kaizhu Huang
Kaizhu Huang
Professor, Duke Kunshan University
Generalization & RobustnessStatistical Learning ThoeryTrustworthy AI