🤖 AI Summary
This work investigates the feasibility of extended LSTM (xLSTM) for self-supervised general-purpose audio representation learning. We propose Audio xLSTM (AxLSTM), the first adaptation of xLSTM to this task, leveraging spectrogram patch masking for sequential audio representation learning. AxLSTM preserves strong long-range dependency modeling while substantially improving generalization and parameter efficiency. Pretrained on AudioSet, it outperforms the SSAST baseline by up to 20% in average accuracy across ten downstream audio understanding tasks, with up to 45% fewer parameters. Our key contributions are: (1) pioneering the application of xLSTM to self-supervised audio representation learning; and (2) empirically demonstrating that xLSTM maintains superior sequential modeling capabilities while achieving enhanced transferability to downstream tasks and a more favorable parameter–performance trade-off compared to state-of-the-art alternatives.
📝 Abstract
While the transformer has emerged as the eminent neural architecture, several independent lines of research have emerged to address its limitations. Recurrent neural approaches have also observed a lot of renewed interest, including the extended long short-term memory (xLSTM) architecture, which reinvigorates the original LSTM architecture. However, while xLSTMs have shown competitive performance compared to the transformer, their viability for learning self-supervised general-purpose audio representations has not yet been evaluated. This work proposes Audio xLSTM (AxLSTM), an approach to learn audio representations from masked spectrogram patches in a self-supervised setting. Pretrained on the AudioSet dataset, the proposed AxLSTM models outperform comparable self-supervised audio spectrogram transformer (SSAST) baselines by up to 20% in relative performance across a set of ten diverse downstream tasks while having up to 45% fewer parameters.