ASDA: Audio Spectrogram Differential Attention Mechanism for Self-Supervised Representation Learning

📅 2025-07-03

📈 Citations: 0

✨ Influential: 0

career value

255K/year

🤖 AI Summary

Standard Transformers in audio self-supervised representation learning often allocate attention to irrelevant spectral regions, diminishing discriminability. To address this, we propose Differential Attention—a novel mechanism that explicitly suppresses spurious attention responses via dual-Softmax normalization and learnable differential coefficients, thereby enhancing selective focus on discriminative time-frequency patterns. Integrated into a spectrogram-input Transformer architecture, it is fully compatible with mainstream self-supervised training paradigms. Evaluated on three benchmarks—AS-2M, SPC-2, and ESC-50—our method achieves state-of-the-art performance: 49.0% mAP, 98.3% accuracy, and 96.1% accuracy, respectively. These results demonstrate significant improvements in representation discriminability and strong generalization across diverse audio classification tasks.

Technology Category

Application Category

📝 Abstract

In recent advancements in audio self-supervised representation learning, the standard Transformer architecture has emerged as the predominant approach, yet its attention mechanism often allocates a portion of attention weights to irrelevant information, potentially impairing the model's discriminative ability. To address this, we introduce a differential attention mechanism, which effectively mitigates ineffective attention allocation through the integration of dual-softmax operations and appropriately tuned differential coefficients. Experimental results demonstrate that our ASDA model achieves state-of-the-art (SOTA) performance across multiple benchmarks, including audio classification (49.0% mAP on AS-2M, 41.5% mAP on AS20K), keyword spotting (98.3% accuracy on SPC-2), and environmental sound classification (96.1% accuracy on ESC-50). These results highlight ASDA's effectiveness in audio tasks, paving the way for broader applications.

Problem

Research questions and friction points this paper is trying to address.

Improves attention allocation in audio self-supervised learning

Enhances discriminative ability by reducing irrelevant attention weights

Achieves SOTA performance in audio classification and keyword spotting

Innovation

Methods, ideas, or system contributions that make the work stand out.

Differential attention mechanism for audio learning

Dual-softmax operations reduce irrelevant attention

State-of-the-art performance in audio tasks

🔎 Similar Papers

Audio xLSTMs: Learning Self-supervised audio representations with xLSTMs