Speaking Clearly: A Simplified Whisper-Based Codec for Low-Bitrate Speech Coding

📅 2025-10-23

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

Speech codecs must balance acoustic fidelity and semantic preservation, yet existing approaches often fail to achieve both simultaneously. This paper proposes a “semantics-first” low-bitrate speech coding framework: it freezes and lightweightizes the Whisper encoder, leveraging its inherent text-alignment capability to enable end-to-end speech discretization without external supervision, jointly optimizing semantic and acoustic objectives. Departing from conventional acoustic reconstruction paradigms, the method prioritizes semantic consistency as the primary optimization driver and employs an efficient quantization strategy to yield compact discrete representations. Experiments demonstrate that, at equal bitrates, our approach significantly outperforms Mimi Codec and SpeechTokenizer in semantic accuracy, speech intelligibility, and synthesis clarity. To the best of our knowledge, this is the first work to validate the feasibility and effectiveness of freezing a pre-trained ASR model as a semantic anchor for speech coding.

Technology Category

Application Category

📝 Abstract

Speech codecs serve as bridges between continuous speech signals and large language models, yet face an inherent conflict between acoustic fidelity and semantic preservation. To mitigate this conflict, prevailing methods augment acoustic codecs with complex semantic supervision. We explore the opposite direction: a semantic-first approach that starts from a semantically-capable model and adapts it for high-fidelity acoustic reconstruction. Through empirical analysis, we discover that targeted architectural simplification can unlock the acoustic modeling potential of Whisper, a text-aligned Automatic Speech Recognition (ASR) model. Based on this finding, we propose SimWhisper-Codec, a novel codec that balances the semantic and acoustic preservation by leveraging a frozen, simplified Whisper encoder without requiring external supervision. Experimental results demonstrate that SimWhisper-Codec achieves superior performance in both semantic preservation and acoustic quality compared to semantically-supervised codecs such as Mimi Codec and SpeechTokenizer at similar bitrates, validating the effectiveness of our semantic-first approach. Code is available at https://github.com/ZhangXinWhut/SimWhisper-Codec.

Problem

Research questions and friction points this paper is trying to address.

Resolving conflict between acoustic fidelity and semantic preservation in speech codecs

Developing semantic-first approach using simplified Whisper model

Achieving superior performance without external semantic supervision

Innovation

Methods, ideas, or system contributions that make the work stand out.

Simplified Whisper encoder for acoustic modeling

Semantic-first approach without external supervision

Balances semantic and acoustic preservation efficiently

🔎 Similar Papers

No similar papers found.