🤖 AI Summary
To address the challenge of balancing audio fidelity and compression efficiency at ultra-low bitrates (e.g., 3 kbps) in general-purpose audio coding, this paper proposes FlowMAC—the first end-to-end Mel-spectrogram encoder based on Conditional Flow Matching (CFM). FlowMAC innovatively integrates CFM into audio coding by jointly optimizing Mel-spectrogram encoding, vector quantization, and streaming decoding. At the decoder, an ODE solver drives a continuous normalizing flow to reconstruct high-fidelity spectrograms. Compared to GAN- or DDPM-based approaches, FlowMAC achieves subjective audio quality comparable to 6 kbps codecs at only 3 kbps. Moreover, it offers scalable training, memory-efficient inference, and real-time CPU execution with adjustable quality–complexity trade-offs. FlowMAC significantly advances the fidelity and practicality of low-bitrate audio coding.
📝 Abstract
This paper introduces FlowMAC, a novel neural audio codec for high-quality general audio compression at low bit rates based on conditional flow matching (CFM). FlowMAC jointly learns a mel spectrogram encoder, quantizer and decoder. At inference time the decoder integrates a continuous normalizing flow via an ODE solver to generate a high-quality mel spectrogram. This is the first time that a CFM-based approach is applied to general audio coding, enabling a scalable, simple and memory efficient training. Our subjective evaluations show that FlowMAC at 3 kbps achieves similar quality as state-of-the-art GAN-based and DDPM-based neural audio codecs at double the bit rate. Moreover, FlowMAC offers a tunable inference pipeline, which permits to trade off complexity and quality. This enables real-time coding on CPU, while maintaining high perceptual quality.