Assessing Identity Leakage in Talking Face Generation: Metrics and Evaluation Framework

📅 2025-11-05

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

This paper addresses the “lip identity leakage” problem in audio-driven talking face generation—where lip movements are inadvertently influenced by the visual identity of the reference image rather than being solely driven by audio. To this end, we propose the first systematic evaluation framework. Methodologically, we design three critical test scenarios: silent input, audio-video mismatch, and matched synthesis; and introduce novel, model-agnostic metrics—including lip-sync discrepancy and silent lip-sync score. Our core contributions are: (i) the first formal quantification and detection of lip identity leakage; (ii) empirical revelation of the latent impact of reference image selection on generation consistency; and (iii) establishment of a reproducible benchmarking protocol. Extensive experiments demonstrate that our framework reliably identifies leakage in state-of-the-art models, providing a standardized tool and practical guidelines for fair evaluation and targeted model improvement.

Technology Category

Application Category

📝 Abstract

Inpainting-based talking face generation aims to preserve video details such as pose, lighting, and gestures while modifying only lip motion, often using an identity reference image to maintain speaker consistency. However, this mechanism can introduce lip leaking, where generated lips are influenced by the reference image rather than solely by the driving audio. Such leakage is difficult to detect with standard metrics and conventional test setup. To address this, we propose a systematic evaluation methodology to analyze and quantify lip leakage. Our framework employs three complementary test setups: silent-input generation, mismatched audio-video pairing, and matched audio-video synthesis. We also introduce derived metrics including lip-sync discrepancy and silent-audio-based lip-sync scores. In addition, we study how different identity reference selections affect leakage, providing insights into reference design. The proposed methodology is model-agnostic and establishes a more reliable benchmark for future research in talking face generation.

Problem

Research questions and friction points this paper is trying to address.

Detecting identity leakage in talking face generation models

Quantifying lip leakage through systematic evaluation metrics

Establishing model-agnostic benchmarks for face generation systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Silent-input generation to detect lip leakage

Mismatched audio-video pairing for leakage analysis

Matched audio-video synthesis with derived metrics

🔎 Similar Papers

No similar papers found.

Zillow Group

$104,000.00 - $166,000.00 annually

remote / U.S. (50 states) / California

Research Scientist Intern (TikTok-Privacy Innovation Lab-Multimodal Generative Model) - 2026 Start (PhD)

TikTok

San Jose, California

Research Scientist Intern, Multimodal AI (PhD)