🤖 AI Summary
To address the need for real-time deepfake speech detection, this work proposes an efficient architecture that replaces self-attention with bidirectional Mamba blocks. Methodologically, it employs XLSR-Wav2Vec as the front-end acoustic representation and introduces three novel bidirectional Mamba-based encoders—TransBiMamba, ConBiMamba, and PN-BiMamba—to jointly capture both local fine-grained artifacts and global contextual cues while maintaining low inference latency and modeling long-range temporal dependencies. Experiments on ASVspoof2021 Logical Access (LA), DeepFake (DF), and In-The-Wild benchmarks yield EERs of 0.97%, 1.74%, and 5.85%, respectively—substantially outperforming state-of-the-art models including XLSR-Conformer and XLSR-Mamba. The proposed approach achieves superior accuracy, strong generalization across diverse spoofing attacks and recording conditions, and practical deployability in real-time scenarios.
📝 Abstract
Advances in speech synthesis intensify security threats, motivating real-time deepfake detection research. We investigate whether bidirectional Mamba can serve as a competitive alternative to Self-Attention in detecting synthetic speech. Our solution, Fake-Mamba, integrates an XLSR front-end with bidirectional Mamba to capture both local and global artifacts. Our core innovation introduces three efficient encoders: TransBiMamba, ConBiMamba, and PN-BiMamba. Leveraging XLSR's rich linguistic representations, PN-BiMamba can effectively capture the subtle cues of synthetic speech. Evaluated on ASVspoof 21 LA, 21 DF, and In-The-Wild benchmarks, Fake-Mamba achieves 0.97%, 1.74%, and 5.85% EER, respectively, representing substantial relative gains over SOTA models XLSR-Conformer and XLSR-Mamba. The framework maintains real-time inference across utterance lengths, demonstrating strong generalization and practical viability. The code is available at https://github.com/xuanxixi/Fake-Mamba.