An overview of neural architectures for self-supervised audio representation learning from masked spectrograms

📅 2025-09-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
A systematic survey bridging masked spectrogram modeling and emerging sequence models—such as Mamba and xLSTM—is currently lacking, hindering principled architecture selection in self-supervised audio representation learning. Method: This work presents the first unified, reproducible benchmark comparing Transformer, selective structured state-space models (Mamba), and extended long short-term memory networks (xLSTM) on masked spectrogram reconstruction. Experiments span ten diverse audio classification downstream tasks to comprehensively evaluate representational capacity, computational efficiency, and generalization. Contribution/Results: The study identifies distinct architectural trade-offs and applicability boundaries for audio modeling, filling a critical gap in both the theoretical survey and empirical validation of advanced sequence models for self-supervised audio foundation models. It provides empirically grounded guidance and methodological support for designing general-purpose audio foundation models and selecting appropriate architectures.

Technology Category

Application Category

📝 Abstract
In recent years, self-supervised learning has amassed significant interest for training deep neural representations without labeled data. One such self-supervised learning approach is masked spectrogram modeling, where the objective is to learn semantically rich contextual representations by predicting removed or hidden portions of the input audio spectrogram. With the Transformer neural architecture at its core, masked spectrogram modeling has emerged as the prominent approach for learning general purpose audio representations, a.k.a. audio foundation models. Meanwhile, addressing the issues of the Transformer architecture, in particular the underlying Scaled Dot-product Attention operation, which scales quadratically with input sequence length, has led to renewed interest in recurrent sequence modeling approaches. Among them, Selective structured state space models (such as Mamba) and extended Long Short-Term Memory (xLSTM) are the two most promising approaches which have experienced widespread adoption. While the body of work on these two topics continues to grow, there is currently a lack of an adequate overview encompassing the intersection of these topics. In this paper, we present a comprehensive overview of the aforementioned research domains, covering masked spectrogram modeling and the previously mentioned neural sequence modeling architectures, Mamba and xLSTM. Further, we compare Transformers, Mamba and xLSTM based masked spectrogram models in a unified, reproducible framework on ten diverse downstream audio classification tasks, which will help interested readers to make informed decisions regarding suitability of the evaluated approaches to adjacent applications.
Problem

Research questions and friction points this paper is trying to address.

Lack of comprehensive overview combining masked spectrogram modeling with modern sequence architectures
Addressing Transformer's quadratic scaling issues in audio representation learning
Comparing Transformer, Mamba and xLSTM models across diverse audio classification tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Masked spectrogram modeling for self-supervised audio learning
Transformer architecture with quadratic attention scaling issue
Mamba and xLSTM as recurrent sequence modeling alternatives
🔎 Similar Papers
No similar papers found.