What Counts as Real? Speech Restoration and Voice Quality Conversion Pose New Challenges to Deepfake Detection

📅 2026-03-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses a critical limitation in current audio anti-spoofing systems, which typically classify speech as merely genuine or spoofed, thereby failing to distinguish benign speaker-preserving transformations—such as audio enhancement or voice restoration—from malicious forgeries, often leading to false rejections. To overcome this, the authors propose a multi-class anti-spoofing framework that categorizes speech into four distinct types: original genuine, benignly transformed, spoofed, and transformed spoofed. They reveal, for the first time, that benign transformations induce distributional shifts in self-supervised embedding spaces, compressing the separability between genuine and spoofed samples. By fusing self-supervised embeddings with traditional acoustic features, the proposed method maintains high spoof detection performance while significantly improving robustness to benign transformations, offering a more nuanced and accurate characterization of speech authenticity.

Technology Category

Application Category

📝 Abstract
Audio anti-spoofing systems are typically formulated as binary classifiers distinguishing bona fide from spoofed speech. This assumption fails under layered generative processing, where benign transformations introduce distributional shifts that are misclassified as spoofing. We show that phonation-modifying voice conversion and speech restoration are treated as out-of-distribution despite preserving speaker authenticity. Using a multi-class setup separating bona fide, converted, spoofed, and converted-spoofed speech, we analyse model behaviour through self-supervised learning (SSL) embeddings and acoustic correlates. The benign transformations induce a drift in the SSL space, compressing bona fide and spoofed speech and reducing classifier separability. Reformulating anti-spoofing as a multi-class problem improves robustness to benign shifts while preserving spoof detection, suggesting binary systems model the distribution of raw speech rather than authenticity itself.
Problem

Research questions and friction points this paper is trying to address.

deepfake detection
voice conversion
speech restoration
audio anti-spoofing
distributional shift
Innovation

Methods, ideas, or system contributions that make the work stand out.

deepfake detection
voice conversion
speech restoration
self-supervised learning
multi-class anti-spoofing
🔎 Similar Papers
S
Shree Harsha Bokkahalli Satish
Department of Speech, Music and Hearing, KTH Royal Institute of Technology, Sweden
H
Harm Lameris
Department of Speech, Music and Hearing, KTH Royal Institute of Technology, Sweden
J
Joakim Gustafson
Department of Speech, Music and Hearing, KTH Royal Institute of Technology, Sweden
Éva Székely
Éva Székely
Assistant Professor, KTH Royal Institute of Technology
speech technologyspeech synthesisdeep learninggenerative modellingbias detection