FocalCodec: Low-Bitrate Speech Coding via Focal Modulation Networks

📅 2025-02-06

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

Existing speech encoders suffer from high bitrates, imbalanced semantic–acoustic information preservation, and architectural complexity due to multi-codebook designs—especially under ultra-low bitrates (<1 kbps). This paper introduces FocalCodec: the first end-to-end differentiable speech coding framework incorporating focal modulation. It achieves compression at 0.16–0.65 kbps using only a single binary codebook. Departing from conventional multi-codebook architectures, FocalCodec dynamically enhances representations of salient time-frequency regions via focal modulation, jointly preserving both semantic content and acoustic fidelity. Its design inherently supports multilingual and noise-robust reconstruction. Experiments demonstrate that FocalCodec significantly outperforms state-of-the-art methods in speech reconstruction and voice conversion, reducing bitrate by over 40%. Downstream task evaluations confirm that its discrete latent representations exhibit strong semantic utility and compatibility with generative modeling.

Technology Category

Application Category

📝 Abstract

Large language models have revolutionized natural language processing through self-supervised pretraining on massive datasets. Inspired by this success, researchers have explored adapting these methods to speech by discretizing continuous audio into tokens using neural audio codecs. However, existing approaches face limitations, including high bitrates, the loss of either semantic or acoustic information, and the reliance on multi-codebook designs when trying to capture both, which increases architectural complexity for downstream tasks. To address these challenges, we introduce FocalCodec, an efficient low-bitrate codec based on focal modulation that utilizes a single binary codebook to compress speech between 0.16 and 0.65 kbps. FocalCodec delivers competitive performance in speech resynthesis and voice conversion at lower bitrates than the current state-of-the-art, while effectively handling multilingual speech and noisy environments. Evaluation on downstream tasks shows that FocalCodec successfully preserves sufficient semantic and acoustic information, while also being well-suited for generative modeling. Demo samples, code and checkpoints are available at https://lucadellalib.github.io/focalcodec-web/.

Problem

Research questions and friction points this paper is trying to address.

Low-bitrate speech coding

Preserving semantic and acoustic information

Handling multilingual and noisy environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Low-bitrate speech coding

Focal modulation networks

Single binary codebook

🔎 Similar Papers

No similar papers found.