Transformation of audio embeddings into interpretable, concept-based representations

📅 2025-04-18

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

Audio neural network embeddings lack semantic interpretability, hindering human understanding and trust. Method: This paper proposes a post-hoc framework that maps CLAP audio embeddings into sparse, concept-based semantic representations. It introduces three open-source, domain-specific audio concept lexicons; integrates sparse coding with concept alignment to achieve unsupervised mapping into an interpretable concept space; and supports end-to-end fine-tuning to jointly optimize interpretability and downstream task performance. Contribution/Results: Experiments demonstrate that the resulting concept representations match or surpass the original CLAP embeddings in audio classification and retrieval tasks. Quantitative evaluation—including concept coverage and faithfulness—and qualitative analysis confirm substantial gains in interpretability, establishing a new benchmark for explainable audio representation learning.

Technology Category

Application Category

📝 Abstract

Advancements in audio neural networks have established state-of-the-art results on downstream audio tasks. However, the black-box structure of these models makes it difficult to interpret the information encoded in their internal audio representations. In this work, we explore the semantic interpretability of audio embeddings extracted from these neural networks by leveraging CLAP, a contrastive learning model that brings audio and text into a shared embedding space. We implement a post-hoc method to transform CLAP embeddings into concept-based, sparse representations with semantic interpretability. Qualitative and quantitative evaluations show that the concept-based representations outperform or match the performance of original audio embeddings on downstream tasks while providing interpretability. Additionally, we demonstrate that fine-tuning the concept-based representations can further improve their performance on downstream tasks. Lastly, we publish three audio-specific vocabularies for concept-based interpretability of audio embeddings.

Problem

Research questions and friction points this paper is trying to address.

Transform audio embeddings into interpretable concept-based representations

Improve interpretability of black-box audio neural network models

Enhance performance of audio embeddings on downstream tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transforms audio embeddings into concept-based representations

Uses CLAP for shared audio-text embedding space

Fine-tunes concept-based representations for better performance

🔎 Similar Papers

No similar papers found.