🤖 AI Summary
Existing lip-sync methods struggle with real-world challenges in high-resolution videos—including facial expression leakage, occlusions, and temporal inconsistency—hindering applications like automated dubbing. To address these issues, we propose a two-stage generative framework: (1) modeling temporal lip-motion priors, and (2) fusing audio with adaptively masked facial regions—specifically targeting occluded areas. We introduce a leakage-aware loss and a cross-synchronization training paradigm, and propose LipLeak—a novel metric to quantify expression leakage. Notably, this is the first work to systematically decouple and jointly optimize the intertwined leakage and occlusion problems. Experiments demonstrate state-of-the-art performance in both lip-shape reconstruction and cross-audio synchronization. Our method achieves significant visual quality improvements, reduces LipLeak by 37%, and boosts synchronization accuracy by 22% under occlusion scenarios.
📝 Abstract
Lip synchronization, known as the task of aligning lip movements in an existing video with new input audio, is typically framed as a simpler variant of audio-driven facial animation. However, as well as suffering from the usual issues in talking head generation (e.g., temporal consistency), lip synchronization presents significant new challenges such as expression leakage from the input video and facial occlusions, which can severely impact real-world applications like automated dubbing, but are often neglected in existing works. To address these shortcomings, we present KeySync, a two-stage framework that succeeds in solving the issue of temporal consistency, while also incorporating solutions for leakage and occlusions using a carefully designed masking strategy. We show that KeySync achieves state-of-the-art results in lip reconstruction and cross-synchronization, improving visual quality and reducing expression leakage according to LipLeak, our novel leakage metric. Furthermore, we demonstrate the effectiveness of our new masking approach in handling occlusions and validate our architectural choices through several ablation studies. Code and model weights can be found at https://antonibigata.github.io/KeySync.