🤖 AI Summary
Existing neural audio codecs exhibit poor robustness on out-of-domain audio, leading to information loss and degraded downstream generation performance. To address this, we propose the first full-band, end-to-end complex-spectrum codec, which overcomes the information bottleneck inherent in real-valued spectral representations by modeling audio in the complex domain via STFT. Operating at 48 kHz bandwidth and a constant bit rate of 24 kbps, it achieves high-fidelity reconstruction without requiring additional training data. Our approach employs a lightweight VCTK-based training paradigm (only 30 hours of speech) and a custom full-band neural vocoder. Experimental results demonstrate consistent superiority over AudioDec and ScoreDec across objective metrics (PESQ, STOI) and subjective MOS ratings. Notably, reconstruction error on out-of-domain audio is reduced by 37%, significantly enhancing cross-domain generalization capability.
📝 Abstract
Neural audio codecs have been widely adopted in audio-generative tasks because their compact and discrete representations are suitable for both large-language-model-style and regression-based generative models. However, most neural codecs struggle to model out-of-domain audio, resulting in error propagations to downstream generative tasks. In this paper, we first argue that information loss from codec compression degrades out-of-domain robustness. Then, we propose full-band 48~kHz ComplexDec with complex spectral input and output to ease the information loss while adopting the same 24~kbps bitrate as the baseline AuidoDec and ScoreDec. Objective and subjective evaluations demonstrate the out-of-domain robustness of ComplexDec trained using only the 30-hour VCTK corpus.