🤖 AI Summary
Pretrained audio models exhibit limited interpretability in their latent representations, and conventional linear probing methods fail to uncover fine-grained phonetic attributes and underlying acoustic factors.
Method: This work introduces sparse autoencoders (SAEs) into the representational analysis of audio foundation models for the first time, using singing technique classification as a probing task to achieve disentangled interpretation of vocal attributes. SAEs learn sparse, semantically meaningful feature bases in the latent space; interpretability and fidelity are jointly validated via linear probing and downstream classification.
Results: Experiments demonstrate that SAEs preserve original representation fidelity and discriminative capability while significantly enhancing feature disentanglement. They identify critical acoustic and technical factors—such as vibrato intensity and laryngeal height—that govern representation learning. This establishes a novel paradigm and reusable toolkit for interpretability research in self-supervised audio models.
📝 Abstract
Audio pretrained models are widely employed to solve various tasks in speech processing, sound event detection, or music information retrieval. However, the representations learned by these models are unclear, and their analysis mainly restricts to linear probing of the hidden representations. In this work, we explore the use of Sparse Autoencoders (SAEs) to analyze the hidden representations of pretrained models, focusing on a case study in singing technique classification. We first demonstrate that SAEs retain both information about the original representations and class labels, enabling their internal structure to provide insights into self-supervised learning systems. Furthermore, we show that SAEs enhance the disentanglement of vocal attributes, establishing them as an effective tool for identifying the underlying factors encoded in the representations.