Edge-ASR: Towards Low-Bit Quantization of Automatic Speech Recognition Models

📅 2025-07-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Edge-device deployment of automatic speech recognition (ASR) models is constrained by memory, computational capacity, and power consumption. To address this, we systematically evaluate eight state-of-the-art post-training quantization (PTQ) methods across leading edge-optimized ASR models—including Whisper and Moonshine—assessing their accuracy-efficiency trade-offs. This work presents the first multi-dataset, multi-model empirical validation of 3-bit quantization feasibility for ASR. We propose a unified calibration and evaluation framework, extended from the LLM Compression Toolkit, integrating weight quantization, activation quantization, sub-3-bit inference support, and I/O and bit-operation profiling. Experiments demonstrate that advanced PTQ achieves average word error rate (WER) degradation of less than 0.5% under 3-bit quantization, while reducing model size and computational cost by over 65%, substantially improving energy efficiency on edge devices. Our code and benchmark suite are publicly released.

Technology Category

Application Category

📝 Abstract
Recent advances in Automatic Speech Recognition (ASR) have demonstrated remarkable accuracy and robustness in diverse audio applications, such as live transcription and voice command processing. However, deploying these models on resource constrained edge devices (e.g., IoT device, wearables) still presents substantial challenges due to strict limits on memory, compute and power. Quantization, particularly Post-Training Quantization (PTQ), offers an effective way to reduce model size and inference cost without retraining. Despite its importance, the performance implications of various advanced quantization methods and bit-width configurations on ASR models remain unclear. In this work, we present a comprehensive benchmark of eight state-of-the-art (SOTA) PTQ methods applied to two leading edge-ASR model families, Whisper and Moonshine. We systematically evaluate model performances (i.e., accuracy, memory I/O and bit operations) across seven diverse datasets from the open ASR leaderboard, analyzing the impact of quantization and various configurations on both weights and activations. Built on an extension of the LLM compression toolkit, our framework integrates edge-ASR models, diverse advanced quantization algorithms, a unified calibration and evaluation data pipeline, and detailed analysis tools. Our results characterize the trade-offs between efficiency and accuracy, demonstrating that even 3-bit quantization can succeed on high capacity models when using advanced PTQ techniques. These findings provide valuable insights for optimizing ASR models on low-power, always-on edge devices.
Problem

Research questions and friction points this paper is trying to address.

Quantizing ASR models for edge devices efficiently
Evaluating PTQ methods on Whisper and Moonshine models
Balancing accuracy and efficiency in low-bit quantization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Low-bit quantization for ASR models
Comprehensive benchmark of PTQ methods
3-bit quantization with advanced PTQ
🔎 Similar Papers
No similar papers found.
C
Chen Feng
Qualcomm AI Research
Y
Yicheng Lin
Qualcomm AI Research
Shaojie Zhuo
Shaojie Zhuo
Qualcomm
Efficient Training and InferenceVisionSpeechLanguage
C
Chenzheng Su
Qualcomm AI Research
Ramchalam Kinattinkara Ramakrishnan
Ramchalam Kinattinkara Ramakrishnan
Qualcomm AI Research, Toronto
Machine LearningComputer ScienceDeep Learning
Z
Zhaocong Yuan
Qualcomm AI Research
X
Xiaopeng Zhang
Qualcomm AI Research