🤖 AI Summary
Weak interpretability of representations and dataset- and task-specific disentanglement remain critical bottlenecks in neural audio codecs. To address this, we propose a time-domain neural codec based on spectral decomposition of the input signal, incorporating a soft frequency-band decoupling mechanism that explicitly models semantic independence across frequency subbands during encoding. Our approach further integrates spectral prior structures to guide representation learning. A differentiable separation loss enables joint time–frequency modeling, substantially improving semantic disentanglement and cross-task generalization. Experiments demonstrate that our model surpasses state-of-the-art baselines in reconstruction fidelity (STOI, PESQ) and perceptual quality (MOS). It exhibits strong robustness and versatility across diverse downstream tasks—including speech enhancement, source separation, and audio synthesis—without task-specific architectural modifications.
📝 Abstract
While neural-based models have led to significant advancements in audio feature extraction, the interpretability of the learned representations remains a critical challenge. To address this, disentanglement techniques have been integrated into discrete neural audio codecs to impose structure on the extracted tokens. However, these approaches often exhibit strong dependencies on specific datasets or task formulations. In this work, we propose a disentangled neural audio codec that leverages spectral decomposition of time-domain signals to enhance representation interpretability. Experimental evaluations demonstrate that our method surpasses a state-of-the-art baseline in both reconstruction fidelity and perceptual quality.