ASDA: Audio Spectrogram Differential Attention Mechanism for Self-Supervised Representation Learning

📅 2025-07-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Standard Transformers in audio self-supervised representation learning often allocate attention to irrelevant spectral regions, diminishing discriminability. To address this, we propose Differential Attention—a novel mechanism that explicitly suppresses spurious attention responses via dual-Softmax normalization and learnable differential coefficients, thereby enhancing selective focus on discriminative time-frequency patterns. Integrated into a spectrogram-input Transformer architecture, it is fully compatible with mainstream self-supervised training paradigms. Evaluated on three benchmarks—AS-2M, SPC-2, and ESC-50—our method achieves state-of-the-art performance: 49.0% mAP, 98.3% accuracy, and 96.1% accuracy, respectively. These results demonstrate significant improvements in representation discriminability and strong generalization across diverse audio classification tasks.

Technology Category

Application Category

📝 Abstract
In recent advancements in audio self-supervised representation learning, the standard Transformer architecture has emerged as the predominant approach, yet its attention mechanism often allocates a portion of attention weights to irrelevant information, potentially impairing the model's discriminative ability. To address this, we introduce a differential attention mechanism, which effectively mitigates ineffective attention allocation through the integration of dual-softmax operations and appropriately tuned differential coefficients. Experimental results demonstrate that our ASDA model achieves state-of-the-art (SOTA) performance across multiple benchmarks, including audio classification (49.0% mAP on AS-2M, 41.5% mAP on AS20K), keyword spotting (98.3% accuracy on SPC-2), and environmental sound classification (96.1% accuracy on ESC-50). These results highlight ASDA's effectiveness in audio tasks, paving the way for broader applications.
Problem

Research questions and friction points this paper is trying to address.

Improves attention allocation in audio self-supervised learning
Enhances discriminative ability by reducing irrelevant attention weights
Achieves SOTA performance in audio classification and keyword spotting
Innovation

Methods, ideas, or system contributions that make the work stand out.

Differential attention mechanism for audio learning
Dual-softmax operations reduce irrelevant attention
State-of-the-art performance in audio tasks
🔎 Similar Papers
No similar papers found.