Optimising Neural Speech Codecs for 300bps Communication using Reinforcement Learning

📅 2026-05-19
📈 Citations: 0
Influential: 0
📄 PDF

career value

211K/year
🤖 AI Summary
This work addresses the degradation in speech intelligibility of conventional neural vocoders at ultra-low bitrates (e.g., 300 bps), where optimization for acoustic reconstruction loss often leads to significantly increased word error rates (WER). To tackle this, the authors propose ClariCodec, the first approach to directly optimize intelligibility in ultra-low-bitrate codecs using reinforcement learning. By modeling vector quantization as a stochastic policy and employing WER as the reward signal, ClariCodec fine-tunes the encoder while freezing the acoustic reconstruction module, thereby preserving perceptual quality without compromising intelligibility. Evaluated on LibriSpeech, ClariCodec achieves WERs of 3.55% on test-clean and 10.4% on test-other, representing a 23% relative improvement over the baseline and matching the performance of systems operating at substantially higher bitrates.
📝 Abstract
In bandwidth-constrained communication such as satellite and underwater channels, speech must often be transmitted at ultra-low bitrates where intelligibility is the primary objective. At such extreme compression levels, codecs trained with acoustic reconstruction losses tend to allocate bits to perceptual detail, leading to substantial degradation in word error rate (WER). This paper proposes ClariCodec, a neural speech codec operating at 300 bit per second (bps) that reformulates quantisation as a stochastic policy, enabling reinforcement learning (RL)-based optimisation of intelligibility. Specifically, the encoder is fine-tuned using WER-driven rewards while the acoustic reconstruction pipeline remains frozen. Even without RL, ClariCodec achieves 4.64% WER on the LibriSpeech test-clean set at 300 bps, already competitive with codecs operating at higher bitrates. Further RL fine-tuning reduces WER to 3.55% on test-clean and 10.4% on test-other, corresponding to a 23% relative reduction while preserving perceptual quality.
Problem

Research questions and friction points this paper is trying to address.

ultra-low bitrate
speech intelligibility
neural speech codec
word error rate
bandwidth-constrained communication
Innovation

Methods, ideas, or system contributions that make the work stand out.

neural speech codec
reinforcement learning
ultra-low bitrate
word error rate
stochastic quantization
🔎 Similar Papers
No similar papers found.