Detect All-Type Deepfake Audio: Wavelet Prompt Tuning for Enhanced Auditory Perception

📅 2025-04-09

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

To address the poor cross-type generalization of deepfake audio detection—spanning speech, environmental sounds, singing, and music—this paper introduces the first comprehensive benchmark for multi-type deepfake audio detection. We propose Wavelet Prompt Tuning (WPT), a frequency-domain-aware lightweight adaptation method that uniquely integrates wavelet analysis with prompt learning to model type-invariant auditory forgery cues without introducing additional trainable parameters. Built upon the XLSR self-supervised backbone and the AASIST detection architecture, WPT applies discrete wavelet transform to enhance spectral representation and employs minimal learnable prompts for joint training across all audio types. Evaluated on the full multi-type benchmark, WPT-XLSR-AASIST achieves a mean equal-error rate (EER) of 3.58%, substantially outperforming full-parameter fine-tuning while using only 0.22% (1/458) of its parameters—demonstrating significant improvements in cross-type generalization and deployment efficiency.

Technology Category

Application Category

📝 Abstract

The rapid advancement of audio generation technologies has escalated the risks of malicious deepfake audio across speech, sound, singing voice, and music, threatening multimedia security and trust. While existing countermeasures (CMs) perform well in single-type audio deepfake detection (ADD), their performance declines in cross-type scenarios. This paper is dedicated to studying the alltype ADD task. We are the first to comprehensively establish an all-type ADD benchmark to evaluate current CMs, incorporating cross-type deepfake detection across speech, sound, singing voice, and music. Then, we introduce the prompt tuning self-supervised learning (PT-SSL) training paradigm, which optimizes SSL frontend by learning specialized prompt tokens for ADD, requiring 458x fewer trainable parameters than fine-tuning (FT). Considering the auditory perception of different audio types,we propose the wavelet prompt tuning (WPT)-SSL method to capture type-invariant auditory deepfake information from the frequency domain without requiring additional training parameters, thereby enhancing performance over FT in the all-type ADD task. To achieve an universally CM, we utilize all types of deepfake audio for co-training. Experimental results demonstrate that WPT-XLSR-AASIST achieved the best performance, with an average EER of 3.58% across all evaluation sets. The code is available online.

Problem

Research questions and friction points this paper is trying to address.

Detecting all types of deepfake audio across speech, sound, singing, and music

Improving cross-type deepfake detection performance with fewer trainable parameters

Capturing type-invariant auditory deepfake information from the frequency domain

Innovation

Methods, ideas, or system contributions that make the work stand out.

Wavelet prompt tuning for frequency domain analysis

Prompt tuning SSL with minimal trainable parameters

Co-training with all deepfake audio types

🔎 Similar Papers

Audio Anti-Spoofing Detection: A Survey