🤖 AI Summary
Existing deepfake speech detection methods perform well on clean audio but suffer significant performance degradation in real-world mixed audio containing background music or noise. To address this limitation, this work introduces MixFake, a large-scale benchmark dataset of mixed audio, and proposes a multi-stream prompt tuning framework that, for the first time, injects signal-level multi-stream priors—namely pitch, spectrogram, and texture—into a self-supervised model, thereby moving beyond the conventional semantics-centric paradigm. The proposed approach achieves an equal error rate (EER) of 0.95% on foreground forgery detection and demonstrates an absolute improvement of 7.72% over baseline methods in complex acoustic environments, substantially outperforming existing techniques and significantly enhancing model generalization under non-ideal, real-world listening conditions.
📝 Abstract
Speech deepfake detection has achieved remarkable success in clean environments but faces significant challenges in complex, real-world scenarios where speech is often mixed with background music or noise. Current state-of-the-art methods rely on semantic features from self-supervised learning (SSL) models, which often fail when processing non-speech or mixed-source audio. In this paper, we first introduce MixFake, a large-scale benchmark dataset designed to simulate diverse acoustic environments with varying SNR levels and mixed authenticity components. To address the "semantic-centric" limitation, we propose a Multi-stream Prompt Tuning framework that injects signal-level priors into SSL backbones. By integrating base, frequency, and texture streams through deep prompt injection, our model effectively captures acoustic artifacts. Experimental results demonstrate that our method significantly outperforms existing baselines, achieving a 0.95% EER in foreground detection and a substantial 7.72% absolute improvement in complex background detection tasks. Our dataset and code are available at https://github.com/saltfish233/MixFake.