🤖 AI Summary
Neural audio codecs often rely on data- or task-specific priors to disentangle frequency-band features, resulting in poor interpretability and limited generalizability. Method: We propose a generic soft disentanglement representation learning framework. It first applies spectral decomposition to project time-domain audio into orthogonal frequency-band subspaces; then employs a multi-branch encoder to model each band independently, jointly optimized via reconstruction and perceptual losses. Crucially, the framework imposes no assumptions about task structure or data distribution, enabling task-agnostic, soft intra-band semantic disentanglement. Contribution/Results: Experiments demonstrate significant improvements over state-of-the-art codecs in objective audio quality metrics (e.g., PESQ, STOI) and perceptual fidelity. Moreover, the learned representations exhibit strong generalization to downstream tasks—such as audio inpainting—and provide interpretable, structured frequency-band semantics without architectural or prior constraints.
📝 Abstract
In neural-based audio feature extraction, ensuring that representations capture disentangled information is crucial for model interpretability. However, existing disentanglement methods often rely on assumptions that are highly dependent on data characteristics or specific tasks. In this work, we introduce a generalizable approach for learning disentangled features within a neural architecture. Our method applies spectral decomposition to time-domain signals, followed by a multi-branch audio codec that operates on the decomposed components. Empirical evaluations demonstrate that our approach achieves better reconstruction and perceptual performance compared to a state-of-the-art baseline while also offering potential advantages for inpainting tasks.