FocalCodec-Stream: Streaming Low-Bitrate Speech Coding via Causal Distillation

📅 2025-09-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing neural audio codecs achieve strong performance in low-bitrate reconstruction and downstream tasks but generally lack streaming capability, failing to meet the ultra-low-latency requirements of real-time voice communication. To address this, we propose the first streaming neural speech codec with an end-to-end theoretical latency of 80 ms. Our method employs causal WavLM for multi-stage semantic distillation, introduces a focal modulation mechanism to enhance temporal modeling, and integrates single binary codebook quantization with a lightweight refiner network—jointly optimizing semantic consistency and acoustic fidelity at extremely low bitrates (0.55–0.80 kbps). Experiments demonstrate that our approach significantly outperforms existing streaming codecs in both reconstruction quality (PESQ, STOI) and downstream ASR performance, advancing the quality–latency Pareto frontier.

Technology Category

Application Category

📝 Abstract
Neural audio codecs are a fundamental component of modern generative audio pipelines. Although recent codecs achieve strong low-bitrate reconstruction and provide powerful representations for downstream tasks, most are non-streamable, limiting their use in real-time applications. We present FocalCodec-Stream, a hybrid codec based on focal modulation that compresses speech into a single binary codebook at 0.55 - 0.80 kbps with a theoretical latency of 80 ms. Our approach combines multi-stage causal distillation of WavLM with targeted architectural improvements, including a lightweight refiner module that enhances quality under latency constraints. Experiments show that FocalCodec-Stream outperforms existing streamable codecs at comparable bitrates, while preserving both semantic and acoustic information. The result is a favorable trade-off between reconstruction quality, downstream task performance, latency, and efficiency. Code and checkpoints will be released at https://github.com/lucadellalib/focalcodec.
Problem

Research questions and friction points this paper is trying to address.

Streaming low-bitrate speech coding for real-time applications
Causal distillation to achieve low latency compression
Balancing reconstruction quality with computational efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid codec using focal modulation
Causal distillation of WavLM
Lightweight refiner module enhancement
🔎 Similar Papers
No similar papers found.