🤖 AI Summary
This work addresses the limited generalization of current deepfake speech detection methods, which often stems from models conflating speaker identity cues with synthesis artifacts. To tackle this issue, the paper formally identifies and articulates the “speaker entanglement” problem for the first time. It proposes SNAP, a novel framework that leverages a self-supervised speech encoder to construct a speaker subspace and applies orthogonal projection to remove speaker-related components, thereby effectively disentangling identity information from forgery traces. Extensive experiments demonstrate that SNAP achieves state-of-the-art performance across multiple benchmarks, with particularly notable gains in cross-speaker scenarios, significantly enhancing the model’s generalization capability for deepfake detection.
📝 Abstract
Recent advancements in text-to-speech technologies enable generating high-fidelity synthetic speech nearly indistinguishable from real human voices. While recent studies show the efficacy of self-supervised learning-based speech encoders for deepfake detection, these models struggle to generalize across unseen speakers. Our quantitative analysis suggests these encoder representations are substantially influenced by speaker information, causing detectors to exploit speaker-specific correlations rather than artifact-related cues. We call this phenomenon speaker entanglement. To mitigate this reliance, we introduce SNAP, a speaker-nulling framework. We estimate a speaker subspace and apply orthogonal projection to suppress speaker-dependent components, isolating synthesis artifacts within the residual features. By reducing speaker entanglement, SNAP encourages detectors to focus on artifact-related patterns, leading to state-of-the-art performance.