WaveSP-Net: Learnable Wavelet-Domain Sparse Prompt Tuning for Speech Deepfake Detection

📅 2025-10-06

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

Current front-end approaches for deepfake speech detection rely on full fine-tuning of large pre-trained models (e.g., XLSR), suffering from low parameter efficiency and poor generalization. To address this, we propose a parameter-efficient learnable wavelet-domain sparse prompt tuning method: the first to integrate wavelet transforms with prompt tuning, injecting prompt embeddings into multi-resolution features to precisely localize synthetic artifacts while keeping the pre-trained model frozen. We design a Partial-WSPT-XLSR front-end coupled with a bidirectional Mamba back-end, enabling multi-scale feature extraction and highly efficient adaptation. Our method trains only a minimal number of parameters—less than 0.1% of the base model—and achieves state-of-the-art performance on the newly released Deepfake-Eval-2024 and SpoofCeleb benchmarks. Crucially, it demonstrates significantly improved generalization to real-world scenarios, outperforming prior methods by substantial margins in both accuracy and robustness.

Technology Category

Application Category

📝 Abstract

Modern front-end design for speech deepfake detection relies on full fine-tuning of large pre-trained models like XLSR. However, this approach is not parameter-efficient and may lead to suboptimal generalization to realistic, in-the-wild data types. To address these limitations, we introduce a new family of parameter-efficient front-ends that fuse prompt-tuning with classical signal processing transforms. These include FourierPT-XLSR, which uses the Fourier Transform, and two variants based on the Wavelet Transform: WSPT-XLSR and Partial-WSPT-XLSR. We further propose WaveSP-Net, a novel architecture combining a Partial-WSPT-XLSR front-end and a bidirectional Mamba-based back-end. This design injects multi-resolution features into the prompt embeddings, which enhances the localization of subtle synthetic artifacts without altering the frozen XLSR parameters. Experimental results demonstrate that WaveSP-Net outperforms several state-of-the-art models on two new and challenging benchmarks, Deepfake-Eval-2024 and SpoofCeleb, with low trainable parameters and notable performance gains. The code and models are available at https://github.com/xxuan-acoustics/WaveSP-Net.

Problem

Research questions and friction points this paper is trying to address.

Parameter-efficient speech deepfake detection front-end design

Enhancing generalization to realistic in-the-wild audio data

Localizing subtle synthetic artifacts without altering frozen parameters

Innovation

Methods, ideas, or system contributions that make the work stand out.

Parameter-efficient front-ends fuse prompt-tuning with signal processing

WaveSP-Net combines wavelet-domain sparse prompt tuning and Mamba back-end

Multi-resolution features enhance artifact localization without altering frozen parameters

🔎 Similar Papers

A Comprehensive Survey with Critical Analysis for Deepfake Speech Detection

2024-09-23arXiv.orgCitations: 1