๐ค AI Summary
Speech codecs must balance acoustic fidelity and semantic preservation, yet existing approaches often fail to achieve both simultaneously. This paper proposes a โsemantics-firstโ low-bitrate speech coding framework: it freezes and lightweightizes the Whisper encoder, leveraging its inherent text-alignment capability to enable end-to-end speech discretization without external supervision, jointly optimizing semantic and acoustic objectives. Departing from conventional acoustic reconstruction paradigms, the method prioritizes semantic consistency as the primary optimization driver and employs an efficient quantization strategy to yield compact discrete representations. Experiments demonstrate that, at equal bitrates, our approach significantly outperforms Mimi Codec and SpeechTokenizer in semantic accuracy, speech intelligibility, and synthesis clarity. To the best of our knowledge, this is the first work to validate the feasibility and effectiveness of freezing a pre-trained ASR model as a semantic anchor for speech coding.
๐ Abstract
Speech codecs serve as bridges between continuous speech signals and large language models, yet face an inherent conflict between acoustic fidelity and semantic preservation. To mitigate this conflict, prevailing methods augment acoustic codecs with complex semantic supervision. We explore the opposite direction: a semantic-first approach that starts from a semantically-capable model and adapts it for high-fidelity acoustic reconstruction. Through empirical analysis, we discover that targeted architectural simplification can unlock the acoustic modeling potential of Whisper, a text-aligned Automatic Speech Recognition (ASR) model. Based on this finding, we propose SimWhisper-Codec, a novel codec that balances the semantic and acoustic preservation by leveraging a frozen, simplified Whisper encoder without requiring external supervision. Experimental results demonstrate that SimWhisper-Codec achieves superior performance in both semantic preservation and acoustic quality compared to semantically-supervised codecs such as Mimi Codec and SpeechTokenizer at similar bitrates, validating the effectiveness of our semantic-first approach. Code is available at https://github.com/ZhangXinWhut/SimWhisper-Codec.