Edge-ASR: Towards Low-Bit Quantization of Automatic Speech Recognition Models

📅 2025-07-10

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

Edge-device deployment of automatic speech recognition (ASR) models is constrained by memory, computational capacity, and power consumption. To address this, we systematically evaluate eight state-of-the-art post-training quantization (PTQ) methods across leading edge-optimized ASR models—including Whisper and Moonshine—assessing their accuracy-efficiency trade-offs. This work presents the first multi-dataset, multi-model empirical validation of 3-bit quantization feasibility for ASR. We propose a unified calibration and evaluation framework, extended from the LLM Compression Toolkit, integrating weight quantization, activation quantization, sub-3-bit inference support, and I/O and bit-operation profiling. Experiments demonstrate that advanced PTQ achieves average word error rate (WER) degradation of less than 0.5% under 3-bit quantization, while reducing model size and computational cost by over 65%, substantially improving energy efficiency on edge devices. Our code and benchmark suite are publicly released.

Technology Category

Application Category

📝 Abstract

Recent advances in Automatic Speech Recognition (ASR) have demonstrated remarkable accuracy and robustness in diverse audio applications, such as live transcription and voice command processing. However, deploying these models on resource constrained edge devices (e.g., IoT device, wearables) still presents substantial challenges due to strict limits on memory, compute and power. Quantization, particularly Post-Training Quantization (PTQ), offers an effective way to reduce model size and inference cost without retraining. Despite its importance, the performance implications of various advanced quantization methods and bit-width configurations on ASR models remain unclear. In this work, we present a comprehensive benchmark of eight state-of-the-art (SOTA) PTQ methods applied to two leading edge-ASR model families, Whisper and Moonshine. We systematically evaluate model performances (i.e., accuracy, memory I/O and bit operations) across seven diverse datasets from the open ASR leaderboard, analyzing the impact of quantization and various configurations on both weights and activations. Built on an extension of the LLM compression toolkit, our framework integrates edge-ASR models, diverse advanced quantization algorithms, a unified calibration and evaluation data pipeline, and detailed analysis tools. Our results characterize the trade-offs between efficiency and accuracy, demonstrating that even 3-bit quantization can succeed on high capacity models when using advanced PTQ techniques. These findings provide valuable insights for optimizing ASR models on low-power, always-on edge devices.

Problem

Research questions and friction points this paper is trying to address.

Quantizing ASR models for edge devices efficiently

Evaluating PTQ methods on Whisper and Moonshine models

Balancing accuracy and efficiency in low-bit quantization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Low-bit quantization for ASR models

Comprehensive benchmark of PTQ methods

3-bit quantization with advanced PTQ

🔎 Similar Papers

No similar papers found.