Profiling the Voice: Speaker-Specific Phoneme Fingerprinting for Speech Deepfake Detection

📅 2026-05-17

📈 Citations: 0

✨ Influential: 0

career value

242K/year

🤖 AI Summary

This study addresses the limitations of existing deepfake speech detection methods, which struggle to model speaker-specific pronunciation patterns and thus inadequately protect high-profile individuals. To overcome this, the authors propose a phoneme-based voice profiling (PVP) framework that pioneers a shift from utterance-level to phoneme-level fine-grained modeling. By employing a lightweight Gaussian Mixture Model (GMM), PVP captures speaker-specific acoustic distributions of individual phonemes, enabling the construction of an interpretable, personalized voiceprint using only a small amount of genuine speech—without requiring any synthetic or spoofed samples for training. The work also introduces the first Chinese deepfake dataset featuring public figures, on which the proposed method significantly outperforms state-of-the-art general-purpose detectors, substantially reducing the equal error rate (EER) and supporting phoneme-level forensic analysis.

📝 Abstract

The rapid advancement of generative AI has made audio deepfakes increasingly indistinguishable from authentic human vocals, posing significant threats to persons-of-interest (POI) such as public figures. Current detection systems primarily rely on generic, black-box models that fail to capture speaker-specific idiosyncratic traits and lack interpretability. In this paper, we propose Phoneme-based Voice Profiling (PVP), a novel personalized defense framework. By shifting the detection paradigm from macro-utterance analysis to micro-phonetic modeling, PVP captures the unique acoustic distributions underlying a POI's habitual articulatory patterns. Specifically, our framework models speaker-specific phonetic realizations using lightweight Gaussian Mixture Models (GMMs) estimated solely from bona fide reference speech. This design enables data-efficient profiling and robust generalization to previously unseen spoofing attacks without requiring heavy spoof-specific training. Furthermore, we introduce the first large-scale Chinese POI deepfake dataset to benchmark speaker-specific detection. Experimental results demonstrate that PVP significantly outperforms state-of-the-art generic detectors in POI spoofing scenarios, achieving substantial EER reductions while providing fine-grained, phoneme-level interpretability for forensic analysis. Code and data are available at: https://github.com/JunXue-tech/PVP

Problem

Research questions and friction points this paper is trying to address.

deepfake detection

speaker-specific

phoneme

voice profiling

spoofing

Innovation

Methods, ideas, or system contributions that make the work stand out.

phoneme-level modeling

speaker-specific profiling

Gaussian Mixture Models