Profiling the Voice: Speaker-Specific Phoneme Fingerprinting for Speech Deepfake Detection

📅 2026-05-17
📈 Citations: 0
Influential: 0
📄 PDF

career value

206K/year
🤖 AI Summary
This study addresses the limitations of existing deepfake speech detection methods, which struggle to model speaker-specific pronunciation patterns and thus inadequately protect high-profile individuals. To overcome this, the authors propose a phoneme-based voice profiling (PVP) framework that pioneers a shift from utterance-level to phoneme-level fine-grained modeling. By employing a lightweight Gaussian Mixture Model (GMM), PVP captures speaker-specific acoustic distributions of individual phonemes, enabling the construction of an interpretable, personalized voiceprint using only a small amount of genuine speech—without requiring any synthetic or spoofed samples for training. The work also introduces the first Chinese deepfake dataset featuring public figures, on which the proposed method significantly outperforms state-of-the-art general-purpose detectors, substantially reducing the equal error rate (EER) and supporting phoneme-level forensic analysis.
📝 Abstract
The rapid advancement of generative AI has made audio deepfakes increasingly indistinguishable from authentic human vocals, posing significant threats to persons-of-interest (POI) such as public figures. Current detection systems primarily rely on generic, black-box models that fail to capture speaker-specific idiosyncratic traits and lack interpretability. In this paper, we propose Phoneme-based Voice Profiling (PVP), a novel personalized defense framework. By shifting the detection paradigm from macro-utterance analysis to micro-phonetic modeling, PVP captures the unique acoustic distributions underlying a POI's habitual articulatory patterns. Specifically, our framework models speaker-specific phonetic realizations using lightweight Gaussian Mixture Models (GMMs) estimated solely from bona fide reference speech. This design enables data-efficient profiling and robust generalization to previously unseen spoofing attacks without requiring heavy spoof-specific training. Furthermore, we introduce the first large-scale Chinese POI deepfake dataset to benchmark speaker-specific detection. Experimental results demonstrate that PVP significantly outperforms state-of-the-art generic detectors in POI spoofing scenarios, achieving substantial EER reductions while providing fine-grained, phoneme-level interpretability for forensic analysis. Code and data are available at: https://github.com/JunXue-tech/PVP
Problem

Research questions and friction points this paper is trying to address.

deepfake detection
speaker-specific
phoneme
voice profiling
spoofing
Innovation

Methods, ideas, or system contributions that make the work stand out.

phoneme-level modeling
speaker-specific profiling
Gaussian Mixture Models
deepfake detection
interpretable AI
🔎 Similar Papers
J
Jun Xue
Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education; School of Cyber Science and Engineering, Wuhan University
Tong Zhang
Tong Zhang
Professor of GIS/Remote Sensing, Wuhan University
GeoAImachine learningtransport geography
Z
Zhuolin Yi
Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education; School of Cyber Science and Engineering, Wuhan University
Y
Yihuan Huang
Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education; School of Cyber Science and Engineering, Wuhan University
Y
Yi Chai
Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education; School of Cyber Science and Engineering, Wuhan University
Yiyang Zhang
Yiyang Zhang
Postgraduate, University of Science and Technology of China
Large Language ModelRecommender System
Y
Yanzhen Ren
Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education; School of Cyber Science and Engineering, Wuhan University