Attention-based Mixture of Experts for Robust Speech Deepfake Detection

📅 2025-09-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The proliferation of AI-generated deepfake speech poses a critical security threat, as such synthetic speech becomes increasingly indistinguishable from genuine human speech. Method: This paper proposes an attention-gated Mixture-of-Experts (MoE) detection framework that integrates multiple heterogeneous speech forgery detectors. A learnable attention mechanism dynamically weights expert outputs to adaptively focus on discriminative features characteristic of diverse spoofing artifacts; additionally, structured inductive biases are incorporated to enhance expert specialization and inter-expert synergy. Contribution/Results: The method achieves first place across all tasks in the SAFE Challenge and significantly outperforms existing state-of-the-art approaches on multiple benchmark datasets—including ASVspoof2019 and In-The-Wild—demonstrating superior generalization capability and robustness against domain shifts and unseen attack types.

Technology Category

Application Category

📝 Abstract
AI-generated speech is becoming increasingly used in everyday life, powering virtual assistants, accessibility tools, and other applications. However, it is also being exploited for malicious purposes such as impersonation, misinformation, and biometric spoofing. As speech deepfakes become nearly indistinguishable from real human speech, the need for robust detection methods and effective countermeasures has become critically urgent. In this paper, we present the ISPL's submission to the SAFE challenge at IH&MMSec 2025, where our system ranked first across all tasks. Our solution introduces a novel approach to audio deepfake detection based on a Mixture of Experts architecture. The proposed system leverages multiple state-of-the-art detectors, combining their outputs through an attention-based gating network that dynamically weights each expert based on the input speech signal. In this design, each expert develops a specialized understanding of the shared training data by learning to capture different complementary aspects of the same input through inductive biases. Experimental results indicate that our method outperforms existing approaches across multiple datasets. We further evaluate and analyze the performance of our system in the SAFE challenge.
Problem

Research questions and friction points this paper is trying to address.

Detecting AI-generated speech deepfakes used for malicious purposes
Developing robust detection methods against biometric spoofing attacks
Creating attention-based system to distinguish fake from real speech
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture of Experts architecture for detection
Attention-based gating network for dynamic weighting
Multiple specialized experts capture complementary aspects
🔎 Similar Papers
No similar papers found.