🤖 AI Summary
This work addresses the limitations of existing self-supervised audio understanding methods, which are often constrained to single-granularity modeling and struggle to capture the multi-scale time-frequency structures inherent in complex audio signals, while also suffering from high training costs and slow convergence. To overcome these challenges, we propose the Convolutional Audio Transformer (CAT) framework, which uniquely integrates multi-resolution feature fusion with representation regularization based on a frozen pre-trained encoder. CAT leverages a multi-granular convolutional Transformer to extract hierarchical audio representations and incorporates semantic guidance from an external pre-trained model to enable efficient self-supervised learning. Experiments demonstrate that our approach significantly outperforms current methods on benchmarks such as AudioSet-20k, achieving competitive performance with a fivefold acceleration in convergence speed.
📝 Abstract
Bootstrap-based Self-Supervised Learning (SSL) has achieved remarkable progress in audio understanding. However, existing methods typically operate at a single level of granularity, limiting their ability to model the diverse temporal and spectral structures inherent in complex audio signals. Furthermore, bootstrapping representations from scratch is computationally expensive, often requiring extensive training to converge. In this work, we propose the Convolutional Audio Transformer (CAT), a unified framework designed to address these challenges. First, to capture hierarchical audio features, CAT incorporates a Multi-resolution Block that aggregates information across varying granularities. Second, to enhance training efficiency, we introduce a Representation Regularization objective. Drawing inspiration from generative modeling, this auxiliary task guides the student model by aligning its predictions with high-quality semantic representations from frozen, pre-trained external encoders. Experimental results demonstrate that CAT significantly outperforms baselines on audio understanding benchmarks. Notably, it achieves competitive performance on the AudioSet 20k dataset with 5 times faster convergence than existing methods. Codes and checkpoints will be released soon at https://github.com/realzhouchushu/CAT.