🤖 AI Summary
Standard Transformers in audio self-supervised representation learning often allocate attention to irrelevant spectral regions, diminishing discriminability. To address this, we propose Differential Attention—a novel mechanism that explicitly suppresses spurious attention responses via dual-Softmax normalization and learnable differential coefficients, thereby enhancing selective focus on discriminative time-frequency patterns. Integrated into a spectrogram-input Transformer architecture, it is fully compatible with mainstream self-supervised training paradigms. Evaluated on three benchmarks—AS-2M, SPC-2, and ESC-50—our method achieves state-of-the-art performance: 49.0% mAP, 98.3% accuracy, and 96.1% accuracy, respectively. These results demonstrate significant improvements in representation discriminability and strong generalization across diverse audio classification tasks.
📝 Abstract
In recent advancements in audio self-supervised representation learning, the standard Transformer architecture has emerged as the predominant approach, yet its attention mechanism often allocates a portion of attention weights to irrelevant information, potentially impairing the model's discriminative ability. To address this, we introduce a differential attention mechanism, which effectively mitigates ineffective attention allocation through the integration of dual-softmax operations and appropriately tuned differential coefficients. Experimental results demonstrate that our ASDA model achieves state-of-the-art (SOTA) performance across multiple benchmarks, including audio classification (49.0% mAP on AS-2M, 41.5% mAP on AS20K), keyword spotting (98.3% accuracy on SPC-2), and environmental sound classification (96.1% accuracy on ESC-50). These results highlight ASDA's effectiveness in audio tasks, paving the way for broader applications.