Sparse Autoencoders Make Audio Foundation Models more Explainable

📅 2025-09-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Pretrained audio models exhibit limited interpretability in their latent representations, and conventional linear probing methods fail to uncover fine-grained phonetic attributes and underlying acoustic factors. Method: This work introduces sparse autoencoders (SAEs) into the representational analysis of audio foundation models for the first time, using singing technique classification as a probing task to achieve disentangled interpretation of vocal attributes. SAEs learn sparse, semantically meaningful feature bases in the latent space; interpretability and fidelity are jointly validated via linear probing and downstream classification. Results: Experiments demonstrate that SAEs preserve original representation fidelity and discriminative capability while significantly enhancing feature disentanglement. They identify critical acoustic and technical factors—such as vibrato intensity and laryngeal height—that govern representation learning. This establishes a novel paradigm and reusable toolkit for interpretability research in self-supervised audio models.

Technology Category

Application Category

📝 Abstract
Audio pretrained models are widely employed to solve various tasks in speech processing, sound event detection, or music information retrieval. However, the representations learned by these models are unclear, and their analysis mainly restricts to linear probing of the hidden representations. In this work, we explore the use of Sparse Autoencoders (SAEs) to analyze the hidden representations of pretrained models, focusing on a case study in singing technique classification. We first demonstrate that SAEs retain both information about the original representations and class labels, enabling their internal structure to provide insights into self-supervised learning systems. Furthermore, we show that SAEs enhance the disentanglement of vocal attributes, establishing them as an effective tool for identifying the underlying factors encoded in the representations.
Problem

Research questions and friction points this paper is trying to address.

Analyzing unclear representations in audio pretrained models
Enhancing interpretability of self-supervised learning systems
Disentangling vocal attributes in singing technique classification
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Sparse Autoencoders to analyze hidden representations
Retains original information and class labels effectively
Enhances disentanglement of vocal attributes in audio
🔎 Similar Papers
No similar papers found.