What Counts as Real? Speech Restoration and Voice Quality Conversion Pose New Challenges to Deepfake Detection

📅 2026-03-14

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work addresses a critical limitation in current audio anti-spoofing systems, which typically classify speech as merely genuine or spoofed, thereby failing to distinguish benign speaker-preserving transformations—such as audio enhancement or voice restoration—from malicious forgeries, often leading to false rejections. To overcome this, the authors propose a multi-class anti-spoofing framework that categorizes speech into four distinct types: original genuine, benignly transformed, spoofed, and transformed spoofed. They reveal, for the first time, that benign transformations induce distributional shifts in self-supervised embedding spaces, compressing the separability between genuine and spoofed samples. By fusing self-supervised embeddings with traditional acoustic features, the proposed method maintains high spoof detection performance while significantly improving robustness to benign transformations, offering a more nuanced and accurate characterization of speech authenticity.

Technology Category

Application Category

📝 Abstract

Audio anti-spoofing systems are typically formulated as binary classifiers distinguishing bona fide from spoofed speech. This assumption fails under layered generative processing, where benign transformations introduce distributional shifts that are misclassified as spoofing. We show that phonation-modifying voice conversion and speech restoration are treated as out-of-distribution despite preserving speaker authenticity. Using a multi-class setup separating bona fide, converted, spoofed, and converted-spoofed speech, we analyse model behaviour through self-supervised learning (SSL) embeddings and acoustic correlates. The benign transformations induce a drift in the SSL space, compressing bona fide and spoofed speech and reducing classifier separability. Reformulating anti-spoofing as a multi-class problem improves robustness to benign shifts while preserving spoof detection, suggesting binary systems model the distribution of raw speech rather than authenticity itself.

Problem

Research questions and friction points this paper is trying to address.

deepfake detection

voice conversion

speech restoration

audio anti-spoofing

distributional shift

Innovation

Methods, ideas, or system contributions that make the work stand out.

deepfake detection

voice conversion

speech restoration