🤖 AI Summary
This paper addresses the “lip identity leakage” problem in audio-driven talking face generation—where lip movements are inadvertently influenced by the visual identity of the reference image rather than being solely driven by audio. To this end, we propose the first systematic evaluation framework. Methodologically, we design three critical test scenarios: silent input, audio-video mismatch, and matched synthesis; and introduce novel, model-agnostic metrics—including lip-sync discrepancy and silent lip-sync score. Our core contributions are: (i) the first formal quantification and detection of lip identity leakage; (ii) empirical revelation of the latent impact of reference image selection on generation consistency; and (iii) establishment of a reproducible benchmarking protocol. Extensive experiments demonstrate that our framework reliably identifies leakage in state-of-the-art models, providing a standardized tool and practical guidelines for fair evaluation and targeted model improvement.
📝 Abstract
Inpainting-based talking face generation aims to preserve video details such as pose, lighting, and gestures while modifying only lip motion, often using an identity reference image to maintain speaker consistency. However, this mechanism can introduce lip leaking, where generated lips are influenced by the reference image rather than solely by the driving audio. Such leakage is difficult to detect with standard metrics and conventional test setup. To address this, we propose a systematic evaluation methodology to analyze and quantify lip leakage. Our framework employs three complementary test setups: silent-input generation, mismatched audio-video pairing, and matched audio-video synthesis. We also introduce derived metrics including lip-sync discrepancy and silent-audio-based lip-sync scores. In addition, we study how different identity reference selections affect leakage, providing insights into reference design. The proposed methodology is model-agnostic and establishes a more reliable benchmark for future research in talking face generation.