Mamba in Speech: Towards an Alternative to Self-Attention

📅 2024-05-21

🏛️ arXiv.org

📈 Citations: 35

✨ Influential: 2

career value

197K/year

🤖 AI Summary

Long temporal dependencies and strong semantic correlations in speech signals pose challenges for efficient contextual modeling, motivating exploration of alternatives to Transformer self-attention. Method: This work investigates the applicability of selective state space models (SSMs), specifically Mamba, to speech processing, proposing Bidirectional Mamba (BiMamba)—a novel architecture that enhances context capture via bidirectional state propagation—and integrating it into end-to-end automatic speech recognition (ASR) and speech enhancement frameworks, accompanied by ablation-driven architectural adaptation strategies. Contribution/Results: This is the first systematic empirical validation of Mamba-style SSMs in speech tasks. BiMamba consistently outperforms both standard Mamba and Transformer baselines on ASR and speech enhancement benchmarks, with particularly pronounced gains in semantically sensitive scenarios. Results demonstrate BiMamba’s superior efficiency, generalization capability, and viability as a scalable, attention-free paradigm for speech modeling.

Technology Category

Application Category

📝 Abstract

Transformer and its derivatives have achieved success in diverse tasks across computer vision, natural language processing, and speech processing. To reduce the complexity of computations within the multi-head self-attention mechanism in Transformer, Selective State Space Models (i.e., Mamba) were proposed as an alternative. Mamba exhibited its effectiveness in natural language processing and computer vision tasks, but its superiority has rarely been investigated in speech signal processing. This paper explores solutions for applying Mamba to speech processing by discussing two typical speech processing tasks: speech recognition, which requires semantic and sequential information, and speech enhancement, which focuses primarily on sequential patterns. The experimental results confirm that bidirectional Mamba (BiMamba) consistently outperforms vanilla Mamba, highlighting the advantages of a bidirectional design for speech processing. Moreover, experiments demonstrate the effectiveness of BiMamba as an alternative to the self-attention module in the Transformer model and its derivates, particularly for the semantic-aware task. The crucial technologies for transferring Mamba to speech are then summarized in ablation studies and the discussion section, offering insights for extending this research to a broader scope of tasks.

Problem

Research questions and friction points this paper is trying to address.

Exploring Mamba as alternative to self-attention in speech processing

Evaluating bidirectional Mamba for speech recognition and enhancement

Comparing BiMamba performance with Transformer in semantic-aware tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Selective State Space Models (Mamba)

Introduces bidirectional Mamba (BiMamba)

Replaces self-attention in Transformer models

🔎 Similar Papers

No similar papers found.