Learning Source Disentanglement in Neural Audio Codec

📅 2024-09-17

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

239K/year

🤖 AI Summary

Existing neural audio codecs uniformly model mixed-audio datasets without accounting for semantic disparities among acoustic sources—such as speech, music, and environmental sounds—leading to latent-space ambiguity and poor generation controllability. To address this, we propose the Source-Disentangled Neural Audio Codec (SD-Codec), the first framework that unifies source separation and neural audio coding within a single architecture. SD-Codec employs a domain-aware routing mechanism to map distinct acoustic sources to dedicated discrete codebooks, achieving source-level disentanglement at the codebook level. It jointly optimizes reconstruction loss and separation supervision using a multi-codebook quantization architecture. Experiments demonstrate that SD-Codec maintains state-of-the-art reconstruction quality (STOI: 0.95, PESQ: 3.82) while significantly improving separation performance (SI-SNRi +4.2 dB), thereby validating enhanced latent-space interpretability and generation controllability.

Technology Category

Application Category

📝 Abstract

Neural audio codecs have significantly advanced audio compression by efficiently converting continuous audio signals into discrete tokens. These codecs preserve high-quality sound and enable sophisticated sound generation through generative models trained on these tokens. However, existing neural codec models are typically trained on large, undifferentiated audio datasets, neglecting the essential discrepancies between sound domains like speech, music, and environmental sound effects. This oversight complicates data modeling and poses additional challenges to the controllability of sound generation. To tackle these issues, we introduce the Source-Disentangled Neural Audio Codec (SD-Codec), a novel approach that combines audio coding and source separation. By jointly learning audio resynthesis and separation, SD-Codec explicitly assigns audio signals from different domains to distinct codebooks, sets of discrete representations. Experimental results indicate that SD-Codec not only maintains competitive resynthesis quality but also, supported by the separation results, demonstrates successful disentanglement of different sources in the latent space, thereby enhancing interpretability in audio codec and providing potential finer control over the audio generation process.

Problem

Research questions and friction points this paper is trying to address.

Neural audio codecs overlook domain discrepancies

SD-Codec combines audio coding and source separation

Enhances interpretability and control in audio generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Source-Disentangled Neural Audio Codec

Joint audio resynthesis and separation

Distinct codebooks for audio domains

🔎 Similar Papers

Compositional Audio Representation Learning