WST-X Series: Wavelet Scattering Transform for Interpretable Speech Deepfake Detection

📅 2026-02-03

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This work addresses the longstanding trade-off between interpretability and high-order semantic modeling in deepfake speech detection. While traditional handcrafted features offer transparency but limited performance, self-supervised representations achieve strong accuracy yet lack interpretability. To bridge this gap, we introduce wavelet scattering transform (WST) into this domain for the first time and propose the WST-X family of feature extractors. By leveraging 1D and 2D WST, our approach captures both fine-grained acoustic details and higher-order structural anomalies, ensuring translation invariance and deformation stability while enabling interpretable modeling of forgery artifacts. Through systematic control of the averaging scale (J) and frequency/directional resolution (Q, L), our method significantly outperforms existing front-end approaches on the Deepfake-Eval-2024 benchmark, demonstrating that configurations with small scales and high resolution are crucial for detection performance—thus achieving a unified balance of high accuracy and strong interpretability.

Technology Category

Application Category

📝 Abstract

Designing front-ends for speech deepfake detectors primarily focuses on two categories. Hand-crafted filterbank features are transparent but are limited in capturing high-level semantic details, often resulting in performance gaps compared to self-supervised (SSL) features. SSL features, in turn, lack interpretability and may overlook fine-grained spectral anomalies. We propose the WST-X series, a novel family of feature extractors that combines the best of both worlds via the wavelet scattering transform (WST), integrating wavelets with nonlinearities analogous to deep convolutional networks. We investigate 1D and 2D WSTs to extract acoustic details and higher-order structural anomalies, respectively. Experimental results on the recent and challenging Deepfake-Eval-2024 dataset indicate that WST-X outperforms existing front-ends by a wide margin. Our analysis reveals that a small averaging scale ($J$), combined with high-frequency and directional resolutions ($Q, L$), is critical for capturing subtle artifacts. This underscores the value of translation-invariant and deformation-stable features for robust and interpretable speech deepfake detection.

Problem

Research questions and friction points this paper is trying to address.

speech deepfake detection

interpretability

feature extraction

wavelet scattering transform

spectral anomalies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Wavelet Scattering Transform

Interpretable Deepfake Detection

Speech Forgery Analysis