Audio xLSTMs: Learning Self-supervised audio representations with xLSTMs

📅 2024-08-29

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

169K/year

🤖 AI Summary

This work investigates the feasibility of extended LSTM (xLSTM) for self-supervised general-purpose audio representation learning. We propose Audio xLSTM (AxLSTM), the first adaptation of xLSTM to this task, leveraging spectrogram patch masking for sequential audio representation learning. AxLSTM preserves strong long-range dependency modeling while substantially improving generalization and parameter efficiency. Pretrained on AudioSet, it outperforms the SSAST baseline by up to 20% in average accuracy across ten downstream audio understanding tasks, with up to 45% fewer parameters. Our key contributions are: (1) pioneering the application of xLSTM to self-supervised audio representation learning; and (2) empirically demonstrating that xLSTM maintains superior sequential modeling capabilities while achieving enhanced transferability to downstream tasks and a more favorable parameter–performance trade-off compared to state-of-the-art alternatives.

Technology Category

Application Category

📝 Abstract

While the transformer has emerged as the eminent neural architecture, several independent lines of research have emerged to address its limitations. Recurrent neural approaches have also observed a lot of renewed interest, including the extended long short-term memory (xLSTM) architecture, which reinvigorates the original LSTM architecture. However, while xLSTMs have shown competitive performance compared to the transformer, their viability for learning self-supervised general-purpose audio representations has not yet been evaluated. This work proposes Audio xLSTM (AxLSTM), an approach to learn audio representations from masked spectrogram patches in a self-supervised setting. Pretrained on the AudioSet dataset, the proposed AxLSTM models outperform comparable self-supervised audio spectrogram transformer (SSAST) baselines by up to 20% in relative performance across a set of ten diverse downstream tasks while having up to 45% fewer parameters.

Problem

Research questions and friction points this paper is trying to address.

Evaluating xLSTM for self-supervised audio representation learning

Proposing Audio xLSTM for masked spectrogram patch learning

Comparing AxLSTM performance with SSAST baselines

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses xLSTM for audio representation learning

Learns from masked spectrogram patches self-supervised

Outperforms SSAST with fewer parameters

🔎 Similar Papers

No similar papers found.