MixFake: Benchmarking and Enhancing Audio Deepfake Detection in Diverse Real-world Mixed Audio

📅 2026-05-21

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Existing deepfake speech detection methods perform well on clean audio but suffer significant performance degradation in real-world mixed audio containing background music or noise. To address this limitation, this work introduces MixFake, a large-scale benchmark dataset of mixed audio, and proposes a multi-stream prompt tuning framework that, for the first time, injects signal-level multi-stream priors—namely pitch, spectrogram, and texture—into a self-supervised model, thereby moving beyond the conventional semantics-centric paradigm. The proposed approach achieves an equal error rate (EER) of 0.95% on foreground forgery detection and demonstrates an absolute improvement of 7.72% over baseline methods in complex acoustic environments, substantially outperforming existing techniques and significantly enhancing model generalization under non-ideal, real-world listening conditions.

📝 Abstract

Speech deepfake detection has achieved remarkable success in clean environments but faces significant challenges in complex, real-world scenarios where speech is often mixed with background music or noise. Current state-of-the-art methods rely on semantic features from self-supervised learning (SSL) models, which often fail when processing non-speech or mixed-source audio. In this paper, we first introduce MixFake, a large-scale benchmark dataset designed to simulate diverse acoustic environments with varying SNR levels and mixed authenticity components. To address the "semantic-centric" limitation, we propose a Multi-stream Prompt Tuning framework that injects signal-level priors into SSL backbones. By integrating base, frequency, and texture streams through deep prompt injection, our model effectively captures acoustic artifacts. Experimental results demonstrate that our method significantly outperforms existing baselines, achieving a 0.95% EER in foreground detection and a substantial 7.72% absolute improvement in complex background detection tasks. Our dataset and code are available at https://github.com/saltfish233/MixFake.

Problem

Research questions and friction points this paper is trying to address.

audio deepfake detection

mixed audio

real-world scenarios

background noise

speech authenticity

Innovation

Methods, ideas, or system contributions that make the work stand out.

audio deepfake detection

multi-stream prompt tuning

self-supervised learning

mixed audio

signal-level priors

🔎 Similar Papers

Audio Anti-Spoofing Detection: A Survey

2024-04-22arXiv.orgCitations: 25

A Comprehensive Survey with Critical Analysis for Deepfake Speech Detection

2024-09-23arXiv.orgCitations: 1