Towards Interpretable Framework for Neural Audio Codecs via Sparse Autoencoders: A Case Study on Accent Information

📅 2026-03-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the limited interpretability of neural audio codecs (NACs) regarding paralinguistic information—such as accent—which hinders their deployment in sensitive applications. To this end, the work introduces sparse autoencoders (SAEs) to dissect the dense representations of NACs and establishes an interpretability evaluation framework using accent as a case study, complemented by a novel quantification method based on a relative performance index. The findings reveal that DAC and SpeechTokenizer exhibit superior interpretability among evaluated models. Furthermore, acoustically oriented NACs primarily encode accent through activation magnitudes, whereas speech-oriented NACs rely more on activation positions. Notably, low-bitrate variants of EnCodec demonstrate unexpectedly stronger interpretability, suggesting a non-trivial relationship between compression efficiency and representational transparency.

Technology Category

Application Category

📝 Abstract
Neural Audio Codecs (NACs) are widely adopted in modern speech systems, yet how they encode linguistic and paralinguistic information remains unclear. Improving the interpretability of NAC representations is critical for understanding and deploying them in sensitive applications. Hence, we employ Sparse Autoencoders (SAEs) to decompose dense NAC representations into sparse, interpretable activations. In this work, we focus on a challenging paralinguistic attribute-accent-and propose a framework to quantify NAC interpretability. We evaluate four NAC models under 16 SAE configurations using a relative performance index. Our results show that DAC and SpeechTokenizer achieve the highest interpretability. We further reveal that acoustic-oriented NACs encode accent information primarily in activation magnitudes of sparse representations, whereas phonetic-oriented NACs rely more on activation positions, and that low-bitrate EnCodec variants show higher interpretability.
Problem

Research questions and friction points this paper is trying to address.

Neural Audio Codecs
interpretability
accent
paralinguistic information
Sparse Autoencoders
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Autoencoders
Neural Audio Codecs
Interpretability
Accent Representation
Paralinguistic Information
🔎 Similar Papers
S
Shih-Heng Wang
University of Southern California, USA
Tiantian Feng
Tiantian Feng
Postdoc Researcher
Health and BehaviorsWearable ComputingAffective ComputingSpeech and BiosignalResponsible ML
Aditya Kommineni
Aditya Kommineni
University of Southern California
T
Thanathai Lertpetchpun
University of Southern California, USA
Bowen Yi
Bowen Yi
Assistant Professor, Polytechnique Montréal
nonlinear systemsrobotics
X
Xuan Shi
University of Southern California, USA
S
Shrikanth Narayanan
University of Southern California, USA