FocalCodec-Stream: Streaming Low-Bitrate Speech Coding via Causal Distillation

📅 2025-09-19

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Existing neural audio codecs achieve strong performance in low-bitrate reconstruction and downstream tasks but generally lack streaming capability, failing to meet the ultra-low-latency requirements of real-time voice communication. To address this, we propose the first streaming neural speech codec with an end-to-end theoretical latency of 80 ms. Our method employs causal WavLM for multi-stage semantic distillation, introduces a focal modulation mechanism to enhance temporal modeling, and integrates single binary codebook quantization with a lightweight refiner network—jointly optimizing semantic consistency and acoustic fidelity at extremely low bitrates (0.55–0.80 kbps). Experiments demonstrate that our approach significantly outperforms existing streaming codecs in both reconstruction quality (PESQ, STOI) and downstream ASR performance, advancing the quality–latency Pareto frontier.

Technology Category

Application Category

📝 Abstract

Neural audio codecs are a fundamental component of modern generative audio pipelines. Although recent codecs achieve strong low-bitrate reconstruction and provide powerful representations for downstream tasks, most are non-streamable, limiting their use in real-time applications. We present FocalCodec-Stream, a hybrid codec based on focal modulation that compresses speech into a single binary codebook at 0.55 - 0.80 kbps with a theoretical latency of 80 ms. Our approach combines multi-stage causal distillation of WavLM with targeted architectural improvements, including a lightweight refiner module that enhances quality under latency constraints. Experiments show that FocalCodec-Stream outperforms existing streamable codecs at comparable bitrates, while preserving both semantic and acoustic information. The result is a favorable trade-off between reconstruction quality, downstream task performance, latency, and efficiency. Code and checkpoints will be released at https://github.com/lucadellalib/focalcodec.

Problem

Research questions and friction points this paper is trying to address.

Streaming low-bitrate speech coding for real-time applications

Causal distillation to achieve low latency compression

Balancing reconstruction quality with computational efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid codec using focal modulation

Causal distillation of WavLM

Lightweight refiner module enhancement

🔎 Similar Papers

No similar papers found.

Cohere

Toronto, San Francisco, New York City, London, Paris, Montreal, Seoul, Germany, PST, EST

AI Inference Engineer - Speech

Zoom Video Communications Inc.

$151,800.00 - $332,200.00

San Jose (CA) / Seattle (WA)

AI Research Scientist - Voice AI Team, Meta Superintelligence Labs