🤖 AI Summary
To address high energy consumption in edge-device streaming automatic speech recognition (ASR), which degrades user experience, this paper uncovers an intrinsic relationship between weight parameter energy cost and runtime factors—namely, invocation frequency and memory allocation—and proposes, for the first time, a component-level energy sensitivity modeling framework. Based on this, we establish an energy–accuracy co-optimization paradigm, incorporating dynamic weight configuration and a lightweight streaming ASR architecture. Our core innovation lies in deeply integrating hardware-aware energy-efficiency analysis into model architecture design, enabling fine-grained and transferable energy optimization. Experiments demonstrate that our approach reduces device power consumption by up to 47% over state-of-the-art methods, incurs less than a 0.2 percentage-point increase in word error rate (WER), and improves real-time performance by 1.8×—significantly advancing practical deployment of low-power, high-accuracy, and low-latency edge ASR.
📝 Abstract
Power consumption plays a crucial role in on-device streaming speech recognition, significantly influencing the user experience. This study explores how the configuration of weight parameters in speech recognition models affects their overall energy efficiency. We found that the influence of these parameters on power consumption varies depending on factors such as invocation frequency and memory allocation. Leveraging these insights, we propose design principles that enhance on-device speech recognition models by reducing power consumption with minimal impact on accuracy. Our approach, which adjusts model components based on their specific energy sensitivities, achieves up to 47% lower energy usage while preserving comparable model accuracy and improving real-time performance compared to leading methods.