Pushing the Limits of On-Device Streaming ASR: A Compact, High-Accuracy English Model for Low-Latency Inference

📅 2026-04-15

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This work addresses the demand for efficient streaming automatic speech recognition (ASR) on GPU-free edge devices by reengineering the NVIDIA Nemotron Speech Streaming architecture using ONNX Runtime and introducing, for the first time, int4 k-quant quantization on CPU. By integrating post-training quantization strategies—including importance-weighted k-quant, mixed precision, and round-to-nearest—with graph-level operator fusion, the model size is reduced from 2.47 GB to 0.67 GB with only marginal accuracy degradation (average WER of 8.20%). The optimized system achieves a low inference latency of 0.56 seconds, substantially advancing the quality-efficiency Pareto frontier and enabling high-accuracy, real-time streaming ASR in pure CPU environments.

Technology Category

Application Category

📝 Abstract

Deploying high-quality automatic speech recognition (ASR) on edge devices requires models that jointly optimize accuracy, latency, and memory footprint while operating entirely on CPU without GPU acceleration. We conduct a systematic empirical study of state-of-the-art ASR architectures, encompassing encoder-decoder, transducer, and LLM-based paradigms, evaluated across batch, chunked, and streaming inference modes. Through a comprehensive benchmark of over 50 configurations spanning OpenAI Whisper, NVIDIA Nemotron, Parakeet TDT, Canary, Conformer Transducer, and Qwen3-ASR, we identify NVIDIA's Nemotron Speech Streaming as the strongest candidate for real-time English streaming on resource-constrained hardware. We then re-implement the complete streaming inference pipeline in ONNX Runtime and conduct a controlled evaluation of multiple post-training quantization strategies, including importance-weighted k-quant, mixed-precision schemes, and round-to-nearest quantization, combined with graph-level operator fusion. These optimizations reduce the model from 2.47 GB to as little as 0.67 GB while maintaining word error rate (WER) within 1% absolute of the full-precision PyTorch baseline. Our recommended configuration, the int4 k-quant variant, achieves 8.20% average streaming WER across eight standard benchmarks, running comfortably faster than real-time on CPU with 0.56 s algorithmic latency, establishing a new quality-efficiency Pareto point for on-device streaming ASR.

Problem

Research questions and friction points this paper is trying to address.

on-device ASR

streaming speech recognition

low-latency inference

resource-constrained hardware

CPU-only deployment

Innovation

Methods, ideas, or system contributions that make the work stand out.

streaming ASR

post-training quantization

ONNX Runtime