🤖 AI Summary
Existing neural audio codecs (e.g., SoundStream) are constrained to 24 kHz mono audio, limiting their efficacy for high-fidelity, high-sample-rate, multi-channel compression. To address this, we propose SpectroStream—the first end-to-end neural audio codec supporting full-bandwidth (48 kHz) stereo input. Methodologically, SpectroStream innovatively integrates time-frequency domain representations with a multi-channel delay alignment mechanism: the encoder models cross-band time-frequency features, while the decoder enforces inter-channel phase consistency. Experiments demonstrate that SpectroStream achieves high-quality reconstruction of 48 kHz stereo music across 4–16 kbps bitrates. Notably, even at the minimal 4 kbps, it significantly outperforms baseline mono models. SpectroStream thus establishes the first neural audio compression framework delivering high fidelity, low bitrate, full-bandwidth, and multi-channel capabilities in a single end-to-end architecture.
📝 Abstract
We propose SpectroStream, a full-band multi-channel neural audio codec. Successor to the well-established SoundStream, SpectroStream extends its capability beyond 24 kHz monophonic audio and enables high-quality reconstruction of 48 kHz stereo music at bit rates of 4--16 kbps. This is accomplished with a new neural architecture that leverages audio representation in the time-frequency domain, which leads to better audio quality especially at higher sample rate. The model also uses a delayed-fusion strategy to handle multi-channel audio, which is crucial in balancing per-channel acoustic quality and cross-channel phase consistency.