SSAMBA: Self-Supervised Audio Representation Learning With Mamba State Space Model

📅 2024-05-20

🏛️ Spoken Language Technology Workshop

📈 Citations: 12

✨ Influential: 2

career value

226K/year

🤖 AI Summary

To address the O(L²) computational and memory overhead of Transformer-based audio models caused by self-attention, this paper proposes SSAMBA—the first attention-free Mamba architecture tailored for self-supervised audio representation learning. Methodologically, SSAMBA introduces three key innovations: (1) the first adaptation of the state-space model (Mamba) to audio processing; (2) a bidirectional Mamba structure designed to capture long-range temporal-spectral dynamics; and (3) a unified pretraining framework jointly optimizing discriminative (contrastive learning) and generative (masked spectrogram reconstruction) objectives. Evaluated on audio classification, keyword spotting, speaker identification, and emotion recognition, SSAMBA consistently outperforms SSAST. Its Tiny variant achieves 92.7% inference speedup and reduces GPU memory consumption by 95.4% (at 22K tokens), significantly enhancing efficiency for edge deployment.

Technology Category

Application Category

📝 Abstract

Transformers have revolutionized deep learning across various tasks, including audio representation learning, due to their powerful modeling capabilities. However, they often suffer from quadratic complexity in both GPU memory usage and computational inference time, affecting their efficiency. Recently, state space models (SSMs) like Mamba have emerged as a promising alternative, offering a more efficient approach by avoiding these complexities. Given these advantages, we explore the potential of SSM-based models in audio tasks. In this paper, we introduce Self-Supervised Audio Mamba (SSAMBA), the first self-supervised, attention-free, and SSM-based model for audio representation learning. SSAMBA leverages the bidirectional Mamba to capture complex audio patterns effectively. We incorporate a self-supervised pretraining framework that optimizes both discriminative and generative objectives, enabling the model to learn robust audio representations from large-scale, unlabeled datasets. We evaluated SSAMBA on various tasks such as audio classification, keyword spotting, speaker identification, and emotion recognition. Our results demonstrate that SSAMBA outperforms the Self-Supervised Audio Spectrogram Transformer (SSAST) in most tasks. Notably, SSAMBA is approximately 92.7% faster in batch inference speed and 95.4% more memory-efficient than SSAST for the tiny model size with an input token size of 22k. These efficiency gains, combined with superior performance, underscore the effectiveness of SSAMBA’s architectural innovation, making it a compelling choice for a wide range of audio processing applications. Code at https://github.com/SiavashShams/ssamba.

Problem

Research questions and friction points this paper is trying to address.

Efficient audio representation learning

Self-supervised attention-free model

Mamba state space model advantages

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mamba State Space Model

Self-Supervised Pretraining Framework

Bidirectional Audio Pattern Capture

🔎 Similar Papers

Audio xLSTMs: Learning Self-supervised audio representations with xLSTMs

2024-08-29arXiv.orgCitations: 2

Cohere

Toronto, San Francisco, New York City, London, Paris, Montreal, Seoul, Germany, PST, EST

AI Research Scientist - Meta Superintelligence Labs (PhD)