Full-Frequency Temporal Patching and Structured Masking for Enhanced Audio Classification

📅 2025-08-28

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

Existing audio classification models (e.g., AST, Audio Mamba) adopt square spectrogram patching inherited from vision, which disrupts frequency continuity and introduces redundant patches—leading to high computational overhead and inefficient training. To address this, we propose Full-Frequency Temporal Patching (FFTP) and Structured Spectrogram Masking (SpecMask). FFTP partitions the full-frequency spectrogram along the time axis into long temporal segments, preserving harmonic structure while drastically reducing the number of patches. SpecMask applies structured masking in the time–frequency domain to enhance temporal robustness and spectral continuity. Our methods are architecture-agnostic, seamlessly integrating into both Transformer and State Space Model (SSM) backbones. On AudioSet-18k, our approach improves mean average precision (mAP) by 6.76 points; on SpeechCommandsV2, it boosts accuracy by 8.46 percentage points. Moreover, computational cost is reduced by up to 83.26%. These results demonstrate a significant co-improvement in both model performance and inference efficiency.

Technology Category

Application Category

📝 Abstract

Transformers and State-Space Models (SSMs) have advanced audio classification by modeling spectrograms as sequences of patches. However, existing models such as the Audio Spectrogram Transformer (AST) and Audio Mamba (AuM) adopt square patching from computer vision, which disrupts continuous frequency patterns and produces an excessive number of patches, slowing training, and increasing computation. We propose Full-Frequency Temporal Patching (FFTP), a patching strategy that better matches the time-frequency asymmetry of spectrograms by spanning full frequency bands with localized temporal context, preserving harmonic structure, and significantly reducing patch count and computation. We also introduce SpecMask, a patch-aligned spectrogram augmentation that combines full-frequency and localized time-frequency masks under a fixed masking budget, enhancing temporal robustness while preserving spectral continuity. When applied on both AST and AuM, our patching method with SpecMask improves mAP by up to +6.76 on AudioSet-18k and accuracy by up to +8.46 on SpeechCommandsV2, while reducing computation by up to 83.26%, demonstrating both performance and efficiency gains.

Problem

Research questions and friction points this paper is trying to address.

Improves audio classification by preserving harmonic structure

Reduces computational cost via optimized spectrogram patching

Enhances model robustness with structured spectrogram masking

Innovation

Methods, ideas, or system contributions that make the work stand out.

Full-Frequency Temporal Patching preserves harmonic structure

SpecMask combines full-frequency and time-frequency masks

Method reduces computation while improving classification accuracy

🔎 Similar Papers

No similar papers found.

Cohere

Toronto, San Francisco, New York City, London, Paris, Montreal, Seoul, Germany, PST, EST

Speech and Audio Systems Engineer

Qualcomm

$122,500.00 - $183,700.00

San Diego, NA

AI Research Scientist - Meta Superintelligence Labs (PhD)