Diverse Audio Embeddings -- Bringing Features Back Outperforms CLAP!

📅 2023-09-15

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

This work addresses fine-grained audio classification across hundreds of categories. Methodologically, it proposes a hybrid embedding framework that synergistically integrates domain-informed acoustic features (e.g., pitch, timbre) with end-to-end neural representations. A multi-branch architecture separately extracts handcrafted features and deep embeddings, followed by joint fusion at both feature and representation levels; contrastive learning is further employed to optimize cross-modal alignment. The key contribution is the first empirical demonstration that judicious integration of handcrafted features can significantly outperform state-of-the-art end-to-end models (e.g., CLAP), challenging the prevailing assumption that end-to-end learning inherently surpasses feature engineering. Evaluated on multiple benchmark datasets, the method achieves higher accuracy with fewer parameters, while exhibiting superior robustness and generalization. It establishes a new paradigm for audio representation learning—both high-performing and interpretable.

📝 Abstract

With the advent of modern AI architectures, a shift has happened towards end-to-end architectures. This pivot has led to neural architectures being trained without domain-specific biases/knowledge, optimized according to the task. We in this paper, learn audio embeddings via diverse feature representations, in this case, domain-specific. For the case of audio classification over hundreds of categories of sound, we learn robust separate embeddings for diverse audio properties such as pitch, timbre, and neural representation, along with also learning it via an end-to-end architecture. We observe handcrafted embeddings, e.g., pitch and timbre-based, although on their own, are not able to beat a fully end-to-end representation, yet adding these together with end-to-end embedding helps us, significantly improve performance. This work would pave the way to bring some domain expertise with end-to-end models to learn robust, diverse representations, surpassing the performance of just training end-to-end models.

Problem

Research questions and friction points this paper is trying to address.

Learning audio embeddings via diverse feature representations

Improving audio classification with domain-specific embeddings

Combining handcrafted and end-to-end embeddings for better performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combining diverse audio feature embeddings

Integrating domain-specific and end-to-end embeddings

Outperforming pure end-to-end models with hybrid approach

🔎 Similar Papers

Compositional Audio Representation Learning