Assessing Identity Leakage in Talking Face Generation: Metrics and Evaluation Framework

📅 2025-11-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the “lip identity leakage” problem in audio-driven talking face generation—where lip movements are inadvertently influenced by the visual identity of the reference image rather than being solely driven by audio. To this end, we propose the first systematic evaluation framework. Methodologically, we design three critical test scenarios: silent input, audio-video mismatch, and matched synthesis; and introduce novel, model-agnostic metrics—including lip-sync discrepancy and silent lip-sync score. Our core contributions are: (i) the first formal quantification and detection of lip identity leakage; (ii) empirical revelation of the latent impact of reference image selection on generation consistency; and (iii) establishment of a reproducible benchmarking protocol. Extensive experiments demonstrate that our framework reliably identifies leakage in state-of-the-art models, providing a standardized tool and practical guidelines for fair evaluation and targeted model improvement.

Technology Category

Application Category

📝 Abstract
Inpainting-based talking face generation aims to preserve video details such as pose, lighting, and gestures while modifying only lip motion, often using an identity reference image to maintain speaker consistency. However, this mechanism can introduce lip leaking, where generated lips are influenced by the reference image rather than solely by the driving audio. Such leakage is difficult to detect with standard metrics and conventional test setup. To address this, we propose a systematic evaluation methodology to analyze and quantify lip leakage. Our framework employs three complementary test setups: silent-input generation, mismatched audio-video pairing, and matched audio-video synthesis. We also introduce derived metrics including lip-sync discrepancy and silent-audio-based lip-sync scores. In addition, we study how different identity reference selections affect leakage, providing insights into reference design. The proposed methodology is model-agnostic and establishes a more reliable benchmark for future research in talking face generation.
Problem

Research questions and friction points this paper is trying to address.

Detecting identity leakage in talking face generation models
Quantifying lip leakage through systematic evaluation metrics
Establishing model-agnostic benchmarks for face generation systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Silent-input generation to detect lip leakage
Mismatched audio-video pairing for leakage analysis
Matched audio-video synthesis with derived metrics
🔎 Similar Papers
No similar papers found.
D
Dogucan Yaman
Karlsruhe Institute of Technology
F
Fevziye Irem Eyiokur
Karlsruhe Institute of Technology
H
H. K. Ekenel
Istanbul Technical University
Alexander Waibel
Alexander Waibel
Carnegie Mellon, KIT, Karlsruhe Institute of Technology, University of Karlsruhe
Machine LearningNeural NetworksSpeech TranslationMultimodal Interfaces