PhonemeFake: Redefining Deepfake Realism with Language-Driven Segmental Manipulation and Adaptive Bilevel Detection

📅 2025-06-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing Deepfake datasets lack perceptual realism and fail to emulate real-world threats effectively. To address this, we propose PhonemeFake—a linguistically driven, phoneme-level speech manipulation framework that leverages linguistic reasoning to identify critical forgery segments and employs generative adversarial networks (GANs) to synthesize high-fidelity spoofed speech, substantially reducing both human detection rates and the accuracy of state-of-the-art detection models. We further introduce an adaptive two-tier detection architecture enabling precise localization of forged regions and efficient binary classification. The method is integrated into an open-source framework, and a new benchmark dataset is publicly released on Hugging Face. Extensive experiments across three mainstream Deepfake speech datasets demonstrate a 91% reduction in equal error rate (EER), a 90% improvement in inference speed, and negligible computational overhead—achieving a balanced trade-off among perceptual realism, detectability, and deployment practicality.

Technology Category

Application Category

📝 Abstract
Deepfake (DF) attacks pose a growing threat as generative models become increasingly advanced. However, our study reveals that existing DF datasets fail to deceive human perception, unlike real DF attacks that influence public discourse. It highlights the need for more realistic DF attack vectors. We introduce PhonemeFake (PF), a DF attack that manipulates critical speech segments using language reasoning, significantly reducing human perception by up to 42% and benchmark accuracies by up to 94%. We release an easy-to-use PF dataset on HuggingFace and open-source bilevel DF segment detection model that adaptively prioritizes compute on manipulated regions. Our extensive experiments across three known DF datasets reveal that our detection model reduces EER by 91% while achieving up to 90% speed-up, with minimal compute overhead and precise localization beyond existing models as a scalable solution.
Problem

Research questions and friction points this paper is trying to address.

Enhancing deepfake realism using language-driven segment manipulation
Improving detection accuracy with adaptive bilevel segment analysis
Addressing human perception gaps in current deepfake datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Language-driven segmental manipulation for deepfake realism
Adaptive bilevel detection prioritizing compute efficiency
Open-source dataset and model for scalable solution
🔎 Similar Papers
No similar papers found.
O
Oguzhan Baser
Department of Electrical and Computer Engineering, The University of Texas at Austin, USA
Ahmet Ege Tanriverdi
Ahmet Ege Tanriverdi
Undergraduate Student, Bogazici University
Representation LearningDeep LearningOptimization TheoryStatistical Inference
Sriram Vishwanath
Sriram Vishwanath
MITRE
Information & Coding TheoryCommunications/NetworkingBlockchains/CryptoAI/ML/Data Science
S
Sandeep P. Chinchali
Department of Electrical and Computer Engineering, The University of Texas at Austin, USA