Audio xLSTMs: Learning Self-supervised audio representations with xLSTMs

📅 2024-08-29
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the feasibility of extended LSTM (xLSTM) for self-supervised general-purpose audio representation learning. We propose Audio xLSTM (AxLSTM), the first adaptation of xLSTM to this task, leveraging spectrogram patch masking for sequential audio representation learning. AxLSTM preserves strong long-range dependency modeling while substantially improving generalization and parameter efficiency. Pretrained on AudioSet, it outperforms the SSAST baseline by up to 20% in average accuracy across ten downstream audio understanding tasks, with up to 45% fewer parameters. Our key contributions are: (1) pioneering the application of xLSTM to self-supervised audio representation learning; and (2) empirically demonstrating that xLSTM maintains superior sequential modeling capabilities while achieving enhanced transferability to downstream tasks and a more favorable parameter–performance trade-off compared to state-of-the-art alternatives.

Technology Category

Application Category

📝 Abstract
While the transformer has emerged as the eminent neural architecture, several independent lines of research have emerged to address its limitations. Recurrent neural approaches have also observed a lot of renewed interest, including the extended long short-term memory (xLSTM) architecture, which reinvigorates the original LSTM architecture. However, while xLSTMs have shown competitive performance compared to the transformer, their viability for learning self-supervised general-purpose audio representations has not yet been evaluated. This work proposes Audio xLSTM (AxLSTM), an approach to learn audio representations from masked spectrogram patches in a self-supervised setting. Pretrained on the AudioSet dataset, the proposed AxLSTM models outperform comparable self-supervised audio spectrogram transformer (SSAST) baselines by up to 20% in relative performance across a set of ten diverse downstream tasks while having up to 45% fewer parameters.
Problem

Research questions and friction points this paper is trying to address.

Evaluating xLSTM for self-supervised audio representation learning
Proposing Audio xLSTM for masked spectrogram patch learning
Comparing AxLSTM performance with SSAST baselines
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses xLSTM for audio representation learning
Learns from masked spectrogram patches self-supervised
Outperforms SSAST with fewer parameters
🔎 Similar Papers
No similar papers found.
S
Sarthak Yadav
Department of Electronic Systems, Aalborg University, Aalborg, Denmark; Pioneer Centre for Artificial Intelligence, Denmark
S
S. Theodoridis
Department of Electronic Systems, Aalborg University, Aalborg, Denmark; National and Kapodistrian University of Athens, Athens, Greece
Zheng-Hua Tan
Zheng-Hua Tan
Professor of Machine Learning and Speech Processing, Aalborg University and Pioneer Centre for AI
Machine learningdeep learningself-supervised learningspeech processingmultimodal.