Transformation of audio embeddings into interpretable, concept-based representations

📅 2025-04-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Audio neural network embeddings lack semantic interpretability, hindering human understanding and trust. Method: This paper proposes a post-hoc framework that maps CLAP audio embeddings into sparse, concept-based semantic representations. It introduces three open-source, domain-specific audio concept lexicons; integrates sparse coding with concept alignment to achieve unsupervised mapping into an interpretable concept space; and supports end-to-end fine-tuning to jointly optimize interpretability and downstream task performance. Contribution/Results: Experiments demonstrate that the resulting concept representations match or surpass the original CLAP embeddings in audio classification and retrieval tasks. Quantitative evaluation—including concept coverage and faithfulness—and qualitative analysis confirm substantial gains in interpretability, establishing a new benchmark for explainable audio representation learning.

Technology Category

Application Category

📝 Abstract
Advancements in audio neural networks have established state-of-the-art results on downstream audio tasks. However, the black-box structure of these models makes it difficult to interpret the information encoded in their internal audio representations. In this work, we explore the semantic interpretability of audio embeddings extracted from these neural networks by leveraging CLAP, a contrastive learning model that brings audio and text into a shared embedding space. We implement a post-hoc method to transform CLAP embeddings into concept-based, sparse representations with semantic interpretability. Qualitative and quantitative evaluations show that the concept-based representations outperform or match the performance of original audio embeddings on downstream tasks while providing interpretability. Additionally, we demonstrate that fine-tuning the concept-based representations can further improve their performance on downstream tasks. Lastly, we publish three audio-specific vocabularies for concept-based interpretability of audio embeddings.
Problem

Research questions and friction points this paper is trying to address.

Transform audio embeddings into interpretable concept-based representations
Improve interpretability of black-box audio neural network models
Enhance performance of audio embeddings on downstream tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transforms audio embeddings into concept-based representations
Uses CLAP for shared audio-text embedding space
Fine-tunes concept-based representations for better performance
🔎 Similar Papers
No similar papers found.