ComplexDec: A Domain-robust High-fidelity Neural Audio Codec with Complex Spectrum Modeling

📅 2025-02-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing neural audio codecs exhibit poor robustness on out-of-domain audio, leading to information loss and degraded downstream generation performance. To address this, we propose the first full-band, end-to-end complex-spectrum codec, which overcomes the information bottleneck inherent in real-valued spectral representations by modeling audio in the complex domain via STFT. Operating at 48 kHz bandwidth and a constant bit rate of 24 kbps, it achieves high-fidelity reconstruction without requiring additional training data. Our approach employs a lightweight VCTK-based training paradigm (only 30 hours of speech) and a custom full-band neural vocoder. Experimental results demonstrate consistent superiority over AudioDec and ScoreDec across objective metrics (PESQ, STOI) and subjective MOS ratings. Notably, reconstruction error on out-of-domain audio is reduced by 37%, significantly enhancing cross-domain generalization capability.

Technology Category

Application Category

📝 Abstract
Neural audio codecs have been widely adopted in audio-generative tasks because their compact and discrete representations are suitable for both large-language-model-style and regression-based generative models. However, most neural codecs struggle to model out-of-domain audio, resulting in error propagations to downstream generative tasks. In this paper, we first argue that information loss from codec compression degrades out-of-domain robustness. Then, we propose full-band 48~kHz ComplexDec with complex spectral input and output to ease the information loss while adopting the same 24~kbps bitrate as the baseline AuidoDec and ScoreDec. Objective and subjective evaluations demonstrate the out-of-domain robustness of ComplexDec trained using only the 30-hour VCTK corpus.
Problem

Research questions and friction points this paper is trying to address.

Enhance out-of-domain audio robustness
Reduce information loss in codec compression
Improve fidelity with complex spectral modeling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Complex spectrum modeling
Full-band 48kHz audio
24kbps bitrate efficiency
🔎 Similar Papers
2024-09-26IEEE International Conference on Acoustics, Speech, and Signal ProcessingCitations: 0
Y
Yi-Chiao Wu
Codec Avatars Lab, Meta, Pittsburgh PA, USA
D
Dejan Markovi'c
Codec Avatars Lab, Meta, Pittsburgh PA, USA
S
Steven Krenn
Codec Avatars Lab, Meta, Pittsburgh PA, USA
Israel D. Gebru
Israel D. Gebru
Research Scientist, Meta Reality Research
Machine learningAudioSpeechComputer visionVirtual & Augmented Reality
Alexander Richard
Alexander Richard
Research Scientist, Facebook Reality Labs
Audio processingNeural NetworksAction RecognitionDeep LearningOptimization