🤖 AI Summary
To address the security vulnerability in AI-driven talking-head video conferencing systems—where latent variables are susceptible to adversarial manipulation, enabling real-time identity spoofing—this paper proposes a real-time latent-space identity consistency detection method that operates without reconstructing RGB video. Our approach introduces a pose-conditioned large-margin contrastive encoder to disentangle identity features from expression and pose features directly in the latent space. By leveraging contrastive learning, we extract robust biometric representations and perform lightweight online verification via cosine similarity. Evaluated across multiple state-of-the-art talking-head generation models, our method achieves high detection accuracy, low inference latency, and strong generalization—significantly outperforming existing defense mechanisms. To the best of our knowledge, this is the first solution enabling real-time, rendering-free detection of latent-space puppet attacks.
📝 Abstract
AI-based talking-head videoconferencing systems reduce bandwidth by sending a compact pose-expression latent and re-synthesizing RGB at the receiver, but this latent can be puppeteered, letting an attacker hijack a victim's likeness in real time. Because every frame is synthetic, deepfake and synthetic video detectors fail outright. To address this security problem, we exploit a key observation: the pose-expression latent inherently contains biometric information of the driving identity. Therefore, we introduce the first biometric leakage defense without ever looking at the reconstructed RGB video: a pose-conditioned, large-margin contrastive encoder that isolates persistent identity cues inside the transmitted latent while cancelling transient pose and expression. A simple cosine test on this disentangled embedding flags illicit identity swaps as the video is rendered. Our experiments on multiple talking-head generation models show that our method consistently outperforms existing puppeteering defenses, operates in real-time, and shows strong generalization to out-of-distribution scenarios.