DualCodec: A Low-Frame-Rate, Semantically-Enhanced Neural Audio Codec for Speech Generation

📅 2025-05-19

📈 Citations: 1

✨ Influential: 0

career value

186K/year

🤖 AI Summary

To address the trade-off between degraded audio quality and semantic loss in low-frame-rate speech generation, this paper proposes a dual-stream end-to-end neural audio codec. The first stream employs a self-supervised encoder (Wav2Vec 2.0) to extract semantic representations, while the second stream models raw waveform details; both streams jointly optimize a shared vector-quantized codebook, significantly enhancing semantic density and reconstruction fidelity of the top-layer codes. Crucially, this architecture enables the first differentiable, co-training framework for self-supervised semantics and waveform signals under low-frame-rate constraints—breaking from conventional single-stream distillation paradigms. Experiments demonstrate that, at equivalent low frame rates, our method achieves state-of-the-art performance, outperforming Mimi, SpeechTokenizer, DAC, and EnCodec across STOI (+3.2%), MOS (+0.8), and real-time factor (+37%).

Technology Category

Application Category

📝 Abstract

Neural audio codecs form the foundational building blocks for language model (LM)-based speech generation. Typically, there is a trade-off between frame rate and audio quality. This study introduces a low-frame-rate, semantically enhanced codec model. Existing approaches distill semantically rich self-supervised (SSL) representations into the first-layer codec tokens. This work proposes DualCodec, a dual-stream encoding approach that integrates SSL and waveform representations within an end-to-end codec framework. In this setting, DualCodec enhances the semantic information in the first-layer codec and enables the codec system to maintain high audio quality while operating at a low frame rate. Note that a low-frame-rate codec improves the efficiency of speech generation. Experimental results on audio codec and speech generation tasks confirm the effectiveness of the proposed DualCodec compared to state-of-the-art codec systems, such as Mimi Codec, SpeechTokenizer, DAC, and Encodec. Demos and codes are available at: https://dualcodec.github.io

Problem

Research questions and friction points this paper is trying to address.

Balancing low-frame-rate and high-quality audio in neural codecs

Enhancing semantic information in first-layer codec tokens

Improving speech generation efficiency with dual-stream encoding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-stream encoding integrates SSL and waveform

Low-frame-rate maintains high audio quality

Semantically enhanced first-layer codec tokens

🔎 Similar Papers

No similar papers found.

Cohere

Toronto, San Francisco, New York City, London, Paris, Montreal, Seoul, Germany, PST, EST

Senior Machine Learning Engineer, Voice AI

Together AI

$200,000 - $260,000 + equity + benefits

San Francisco / San Francisco, San Francisco, California, United States

AI Research Scientist - Voice AI Team, Meta Superintelligence Labs