🤖 AI Summary
Existing audio classification models (e.g., AST, Audio Mamba) adopt square spectrogram patching inherited from vision, which disrupts frequency continuity and introduces redundant patches—leading to high computational overhead and inefficient training. To address this, we propose Full-Frequency Temporal Patching (FFTP) and Structured Spectrogram Masking (SpecMask). FFTP partitions the full-frequency spectrogram along the time axis into long temporal segments, preserving harmonic structure while drastically reducing the number of patches. SpecMask applies structured masking in the time–frequency domain to enhance temporal robustness and spectral continuity. Our methods are architecture-agnostic, seamlessly integrating into both Transformer and State Space Model (SSM) backbones. On AudioSet-18k, our approach improves mean average precision (mAP) by 6.76 points; on SpeechCommandsV2, it boosts accuracy by 8.46 percentage points. Moreover, computational cost is reduced by up to 83.26%. These results demonstrate a significant co-improvement in both model performance and inference efficiency.
📝 Abstract
Transformers and State-Space Models (SSMs) have advanced audio classification by modeling spectrograms as sequences of patches. However, existing models such as the Audio Spectrogram Transformer (AST) and Audio Mamba (AuM) adopt square patching from computer vision, which disrupts continuous frequency patterns and produces an excessive number of patches, slowing training, and increasing computation. We propose Full-Frequency Temporal Patching (FFTP), a patching strategy that better matches the time-frequency asymmetry of spectrograms by spanning full frequency bands with localized temporal context, preserving harmonic structure, and significantly reducing patch count and computation. We also introduce SpecMask, a patch-aligned spectrogram augmentation that combines full-frequency and localized time-frequency masks under a fixed masking budget, enhancing temporal robustness while preserving spectral continuity. When applied on both AST and AuM, our patching method with SpecMask improves mAP by up to +6.76 on AudioSet-18k and accuracy by up to +8.46 on SpeechCommandsV2, while reducing computation by up to 83.26%, demonstrating both performance and efficiency gains.