Effective and Efficient Mixed Precision Quantization of Speech Foundation Models

📅 2025-01-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address accuracy degradation and efficiency bottlenecks in compressing speech foundation models (e.g., wav2vec 2.0, HuBERT), this paper proposes an end-to-end mixed-precision quantization method. Unlike conventional two-stage, decoupled approaches, our method unifies bit-width assignment learning and parameter quantization within a single-stage differentiable optimization framework. It jointly optimizes bit widths via gradient-driven learning and employs structural awareness to determine layer-wise precision configurations, enabling simultaneous precision allocation and quantization during training. Evaluated on HuBERT-large, the method achieves an 8.6× lossless compression ratio—1.9× higher than the baseline—while maintaining zero increase in word error rate (WER). Moreover, compression latency is reduced by 1.5–1.9×. These results demonstrate significant improvements in model lightweighting efficiency and deployment feasibility.

Technology Category

Application Category

📝 Abstract
This paper presents a novel mixed-precision quantization approach for speech foundation models that tightly integrates mixed-precision learning and quantized model parameter estimation into one single model compression stage. Experiments conducted on LibriSpeech dataset with fine-tuned wav2vec2.0-base and HuBERT-large models suggest the resulting mixed-precision quantized models increased the lossless compression ratio by factors up to 1.7x and 1.9x over the respective uniform-precision and two-stage mixed-precision quantized baselines that perform precision learning and model parameters quantization in separate and disjointed stages, while incurring no statistically word error rate (WER) increase over the 32-bit full-precision models. The system compression time of wav2vec2.0-base and HuBERT-large models is reduced by up to 1.9 and 1.5 times over the two-stage mixed-precision baselines, while both produce lower WERs. The best-performing 3.5-bit mixed-precision quantized HuBERT-large model produces a lossless compression ratio of 8.6x over the 32-bit full-precision system.
Problem

Research questions and friction points this paper is trying to address.

Speech Recognition
Model Compression
Storage Efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Model Compression
Speech Recognition
Efficiency Improvement
🔎 Similar Papers
No similar papers found.