MixFake: Benchmarking and Enhancing Audio Deepfake Detection in Diverse Real-world Mixed Audio

📅 2026-05-21
📈 Citations: 0
Influential: 0
📄 PDF

career value

231K/year
🤖 AI Summary
Existing deepfake speech detection methods perform well on clean audio but suffer significant performance degradation in real-world mixed audio containing background music or noise. To address this limitation, this work introduces MixFake, a large-scale benchmark dataset of mixed audio, and proposes a multi-stream prompt tuning framework that, for the first time, injects signal-level multi-stream priors—namely pitch, spectrogram, and texture—into a self-supervised model, thereby moving beyond the conventional semantics-centric paradigm. The proposed approach achieves an equal error rate (EER) of 0.95% on foreground forgery detection and demonstrates an absolute improvement of 7.72% over baseline methods in complex acoustic environments, substantially outperforming existing techniques and significantly enhancing model generalization under non-ideal, real-world listening conditions.
📝 Abstract
Speech deepfake detection has achieved remarkable success in clean environments but faces significant challenges in complex, real-world scenarios where speech is often mixed with background music or noise. Current state-of-the-art methods rely on semantic features from self-supervised learning (SSL) models, which often fail when processing non-speech or mixed-source audio. In this paper, we first introduce MixFake, a large-scale benchmark dataset designed to simulate diverse acoustic environments with varying SNR levels and mixed authenticity components. To address the "semantic-centric" limitation, we propose a Multi-stream Prompt Tuning framework that injects signal-level priors into SSL backbones. By integrating base, frequency, and texture streams through deep prompt injection, our model effectively captures acoustic artifacts. Experimental results demonstrate that our method significantly outperforms existing baselines, achieving a 0.95% EER in foreground detection and a substantial 7.72% absolute improvement in complex background detection tasks. Our dataset and code are available at https://github.com/saltfish233/MixFake.
Problem

Research questions and friction points this paper is trying to address.

audio deepfake detection
mixed audio
real-world scenarios
background noise
speech authenticity
Innovation

Methods, ideas, or system contributions that make the work stand out.

audio deepfake detection
multi-stream prompt tuning
self-supervised learning
mixed audio
signal-level priors
🔎 Similar Papers
Q
Qingcao Li
School of Cyber Science and Engineering, Nanjing University of Science and Technology, Nanjing, China
Y
Yipeng Lin
The State Key Laboratory of Blockchain and Data Security, Zhejiang University, Hangzhou, China
W
Weichen Lian
The State Key Laboratory of Blockchain and Data Security, Zhejiang University, Hangzhou, China
Zhongjie Ba
Zhongjie Ba
Zhejiang University
IoT security
Peng Cheng
Peng Cheng
Zhejiang University
IoTAcoustic Security and PrivacyDigital Signal Processing
Z
Zhichao Lian
School of Cyber Science and Engineering, Nanjing University of Science and Technology, Nanjing, China