π€ AI Summary
This work addresses fine-grained audio classification across hundreds of categories. Methodologically, it proposes a hybrid embedding framework that synergistically integrates domain-informed acoustic features (e.g., pitch, timbre) with end-to-end neural representations. A multi-branch architecture separately extracts handcrafted features and deep embeddings, followed by joint fusion at both feature and representation levels; contrastive learning is further employed to optimize cross-modal alignment. The key contribution is the first empirical demonstration that judicious integration of handcrafted features can significantly outperform state-of-the-art end-to-end models (e.g., CLAP), challenging the prevailing assumption that end-to-end learning inherently surpasses feature engineering. Evaluated on multiple benchmark datasets, the method achieves higher accuracy with fewer parameters, while exhibiting superior robustness and generalization. It establishes a new paradigm for audio representation learningβboth high-performing and interpretable.
π Abstract
With the advent of modern AI architectures, a shift has happened towards end-to-end architectures. This pivot has led to neural architectures being trained without domain-specific biases/knowledge, optimized according to the task. We in this paper, learn audio embeddings via diverse feature representations, in this case, domain-specific. For the case of audio classification over hundreds of categories of sound, we learn robust separate embeddings for diverse audio properties such as pitch, timbre, and neural representation, along with also learning it via an end-to-end architecture. We observe handcrafted embeddings, e.g., pitch and timbre-based, although on their own, are not able to beat a fully end-to-end representation, yet adding these together with end-to-end embedding helps us, significantly improve performance. This work would pave the way to bring some domain expertise with end-to-end models to learn robust, diverse representations, surpassing the performance of just training end-to-end models.