SpectroStream: A Versatile Neural Codec for General Audio

📅 2025-08-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing neural audio codecs (e.g., SoundStream) are constrained to 24 kHz mono audio, limiting their efficacy for high-fidelity, high-sample-rate, multi-channel compression. To address this, we propose SpectroStream—the first end-to-end neural audio codec supporting full-bandwidth (48 kHz) stereo input. Methodologically, SpectroStream innovatively integrates time-frequency domain representations with a multi-channel delay alignment mechanism: the encoder models cross-band time-frequency features, while the decoder enforces inter-channel phase consistency. Experiments demonstrate that SpectroStream achieves high-quality reconstruction of 48 kHz stereo music across 4–16 kbps bitrates. Notably, even at the minimal 4 kbps, it significantly outperforms baseline mono models. SpectroStream thus establishes the first neural audio compression framework delivering high fidelity, low bitrate, full-bandwidth, and multi-channel capabilities in a single end-to-end architecture.

Technology Category

Application Category

📝 Abstract
We propose SpectroStream, a full-band multi-channel neural audio codec. Successor to the well-established SoundStream, SpectroStream extends its capability beyond 24 kHz monophonic audio and enables high-quality reconstruction of 48 kHz stereo music at bit rates of 4--16 kbps. This is accomplished with a new neural architecture that leverages audio representation in the time-frequency domain, which leads to better audio quality especially at higher sample rate. The model also uses a delayed-fusion strategy to handle multi-channel audio, which is crucial in balancing per-channel acoustic quality and cross-channel phase consistency.
Problem

Research questions and friction points this paper is trying to address.

Extends audio codec to 48 kHz stereo music
Improves quality with time-frequency neural architecture
Balances multi-channel quality and phase consistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Full-band multi-channel neural audio codec
Time-frequency domain audio representation
Delayed-fusion for multi-channel handling
🔎 Similar Papers
No similar papers found.