LPIPS-AttnWav2Lip: Generic audio-driven lip synchronization for talking head generation in the wild

📅 2023-12-01
🏛️ Speech Communication
📈 Citations: 7
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes a general-purpose approach to speaker-agnostic lip-sync generation by effectively integrating audiovisual features through a residual CBAM attention module embedded within a U-Net architecture. To enhance cross-modal alignment, a semantic alignment module is introduced to expand the receptive field, enabling precise matching between audio and visual representations. The method further incorporates the LPIPS perceptual loss to significantly improve the visual realism of generated faces and the consistency between audio and video. Experimental results demonstrate that the proposed method achieves state-of-the-art performance in both subjective evaluations and objective metrics, exhibiting strong generalization capabilities across unseen speakers and delivering high lip-sync accuracy and image quality.

Technology Category

Application Category

Problem

Research questions and friction points this paper is trying to address.

lip synchronization
talking head generation
audio-driven
audio-visual coherence
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lip Synchronization
Audio-Driven Talking Head
CBAM Attention
Semantic Alignment
LPIPS Loss
🔎 Similar Papers
No similar papers found.
Z
Zhipeng Chen
University of Science and Technology Beijing, Beijing, 100083, China
Xinheng Wang
Xinheng Wang
Xi'an Jiaotong-Liverpool University
Intelligent and Connected SystemsAcoustic LocalizationCommunications and SensingRobotics
L
Lun Xie
University of Science and Technology Beijing, Beijing, 100083, China
H
Haijie Yuan
Xiaoduo Intelligent Technology (Beijing) Co., Ltd, Beijing, 100094, China
H
Hang Pan
Department of Computer Science, Changzhi University, Changzhi, 046011, China