Speaking Clearly: A Simplified Whisper-Based Codec for Low-Bitrate Speech Coding

๐Ÿ“… 2025-10-23
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Speech codecs must balance acoustic fidelity and semantic preservation, yet existing approaches often fail to achieve both simultaneously. This paper proposes a โ€œsemantics-firstโ€ low-bitrate speech coding framework: it freezes and lightweightizes the Whisper encoder, leveraging its inherent text-alignment capability to enable end-to-end speech discretization without external supervision, jointly optimizing semantic and acoustic objectives. Departing from conventional acoustic reconstruction paradigms, the method prioritizes semantic consistency as the primary optimization driver and employs an efficient quantization strategy to yield compact discrete representations. Experiments demonstrate that, at equal bitrates, our approach significantly outperforms Mimi Codec and SpeechTokenizer in semantic accuracy, speech intelligibility, and synthesis clarity. To the best of our knowledge, this is the first work to validate the feasibility and effectiveness of freezing a pre-trained ASR model as a semantic anchor for speech coding.

Technology Category

Application Category

๐Ÿ“ Abstract
Speech codecs serve as bridges between continuous speech signals and large language models, yet face an inherent conflict between acoustic fidelity and semantic preservation. To mitigate this conflict, prevailing methods augment acoustic codecs with complex semantic supervision. We explore the opposite direction: a semantic-first approach that starts from a semantically-capable model and adapts it for high-fidelity acoustic reconstruction. Through empirical analysis, we discover that targeted architectural simplification can unlock the acoustic modeling potential of Whisper, a text-aligned Automatic Speech Recognition (ASR) model. Based on this finding, we propose SimWhisper-Codec, a novel codec that balances the semantic and acoustic preservation by leveraging a frozen, simplified Whisper encoder without requiring external supervision. Experimental results demonstrate that SimWhisper-Codec achieves superior performance in both semantic preservation and acoustic quality compared to semantically-supervised codecs such as Mimi Codec and SpeechTokenizer at similar bitrates, validating the effectiveness of our semantic-first approach. Code is available at https://github.com/ZhangXinWhut/SimWhisper-Codec.
Problem

Research questions and friction points this paper is trying to address.

Resolving conflict between acoustic fidelity and semantic preservation in speech codecs
Developing semantic-first approach using simplified Whisper model
Achieving superior performance without external semantic supervision
Innovation

Methods, ideas, or system contributions that make the work stand out.

Simplified Whisper encoder for acoustic modeling
Semantic-first approach without external supervision
Balances semantic and acoustic preservation efficiently
๐Ÿ”Ž Similar Papers
No similar papers found.