Phoneme-Level Analysis for Person-of-Interest Speech Deepfake Detection

📅 2025-07-11

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

Deepfake speech detection suffers from insufficient granularity and poor interpretability, particularly in person-of-interest (POI)-specific scenarios. To address this, we propose the first phoneme-level fine-grained detection framework: (1) phoneme segmentation of reference speech to construct a POI-specific phoneme-level speaker embedding model; and (2) phoneme-wise comparison between the test sample and the reference model during inference to localize anomalous synthetic segments. Our method integrates phoneme-aware modeling, fine-grained dissimilarity quantification, and generative artifact analysis, thereby significantly enhancing detection interpretability and cross-attack robustness. Experiments demonstrate competitive accuracy relative to state-of-the-art methods, with a 23.6% reduction in false positive rate. Moreover, the framework enables forensically actionable attribution—pinpointing tampered phonemes—establishing a novel paradigm for multimedia trustworthiness verification.

Technology Category

Application Category

📝 Abstract

Recent advances in generative AI have made the creation of speech deepfakes widely accessible, posing serious challenges to digital trust. To counter this, various speech deepfake detection strategies have been proposed, including Person-of-Interest (POI) approaches, which focus on identifying impersonations of specific individuals by modeling and analyzing their unique vocal traits. Despite their excellent performance, the existing methods offer limited granularity and lack interpretability. In this work, we propose a POI-based speech deepfake detection method that operates at the phoneme level. Our approach decomposes reference audio into phonemes to construct a detailed speaker profile. In inference, phonemes from a test sample are individually compared against this profile, enabling fine-grained detection of synthetic artifacts. The proposed method achieves comparable accuracy to traditional approaches while offering superior robustness and interpretability, key aspects in multimedia forensics. By focusing on phoneme analysis, this work explores a novel direction for explainable, speaker-centric deepfake detection.

Problem

Research questions and friction points this paper is trying to address.

Detect speech deepfakes targeting specific individuals

Improve granularity and interpretability in detection methods

Analyze phoneme-level artifacts for robust speaker verification

Innovation

Methods, ideas, or system contributions that make the work stand out.

Phoneme-level analysis for fine-grained detection

Speaker profile construction from phoneme decomposition

Individual phoneme comparison for synthetic artifact identification

🔎 Similar Papers

A Comprehensive Survey with Critical Analysis for Deepfake Speech Detection