SSAMBA: Self-Supervised Audio Representation Learning With Mamba State Space Model

📅 2024-05-20
🏛️ Spoken Language Technology Workshop
📈 Citations: 12
Influential: 2
📄 PDF
🤖 AI Summary
To address the O(L²) computational and memory overhead of Transformer-based audio models caused by self-attention, this paper proposes SSAMBA—the first attention-free Mamba architecture tailored for self-supervised audio representation learning. Methodologically, SSAMBA introduces three key innovations: (1) the first adaptation of the state-space model (Mamba) to audio processing; (2) a bidirectional Mamba structure designed to capture long-range temporal-spectral dynamics; and (3) a unified pretraining framework jointly optimizing discriminative (contrastive learning) and generative (masked spectrogram reconstruction) objectives. Evaluated on audio classification, keyword spotting, speaker identification, and emotion recognition, SSAMBA consistently outperforms SSAST. Its Tiny variant achieves 92.7% inference speedup and reduces GPU memory consumption by 95.4% (at 22K tokens), significantly enhancing efficiency for edge deployment.

Technology Category

Application Category

📝 Abstract
Transformers have revolutionized deep learning across various tasks, including audio representation learning, due to their powerful modeling capabilities. However, they often suffer from quadratic complexity in both GPU memory usage and computational inference time, affecting their efficiency. Recently, state space models (SSMs) like Mamba have emerged as a promising alternative, offering a more efficient approach by avoiding these complexities. Given these advantages, we explore the potential of SSM-based models in audio tasks. In this paper, we introduce Self-Supervised Audio Mamba (SSAMBA), the first self-supervised, attention-free, and SSM-based model for audio representation learning. SSAMBA leverages the bidirectional Mamba to capture complex audio patterns effectively. We incorporate a self-supervised pretraining framework that optimizes both discriminative and generative objectives, enabling the model to learn robust audio representations from large-scale, unlabeled datasets. We evaluated SSAMBA on various tasks such as audio classification, keyword spotting, speaker identification, and emotion recognition. Our results demonstrate that SSAMBA outperforms the Self-Supervised Audio Spectrogram Transformer (SSAST) in most tasks. Notably, SSAMBA is approximately 92.7% faster in batch inference speed and 95.4% more memory-efficient than SSAST for the tiny model size with an input token size of 22k. These efficiency gains, combined with superior performance, underscore the effectiveness of SSAMBA’s architectural innovation, making it a compelling choice for a wide range of audio processing applications. Code at https://github.com/SiavashShams/ssamba.
Problem

Research questions and friction points this paper is trying to address.

Efficient audio representation learning
Self-supervised attention-free model
Mamba state space model advantages
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mamba State Space Model
Self-Supervised Pretraining Framework
Bidirectional Audio Pattern Capture
🔎 Similar Papers
No similar papers found.