Sink or SWIM: Tackling Real-Time ASR at Scale

📅 2026-01-22

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This work addresses the challenge of achieving low latency and high accuracy in real-time automatic speech recognition (ASR) under multi-user concurrent scenarios. The authors propose SWIM, a system built upon the OpenAI Whisper model that enables multilingual, multi-client streaming transcription without modifying the original model architecture. By leveraging model-level parallelism and a buffer-merging mechanism, SWIM significantly improves resource efficiency and throughput while preserving transcription quality. Experimental results demonstrate that, with 20 concurrent streams supporting English, Italian, and Spanish, the system achieves an average latency of 2.4 seconds for five simultaneous clients and maintains a stable word error rate of approximately 8.2%.

Technology Category

Application Category

📝 Abstract

Real-time automatic speech recognition systems are increasingly integrated into interactive applications, from voice assistants to live transcription services. However, scaling these systems to support multiple concurrent clients while maintaining low latency and high accuracy remains a major challenge. In this work, we present SWIM, a novel real-time ASR system built on top of OpenAI's Whisper model that enables true model-level parallelization for scalable, multilingual transcription. SWIM supports multiple concurrent audio streams without modifying the underlying model. It introduces a buffer merging strategy that maintains transcription fidelity while ensuring efficient resource usage. We evaluate SWIM in multi-client settings -- scaling up to 20 concurrent users -- and show that it delivers accurate real-time transcriptions in English, Italian, and Spanish, while maintaining low latency and high throughput. While Whisper-Streaming achieves a word error rate of approximately 8.2% with an average delay of approximately 3.4 s in a single-client, English-only setting, SWIM extends this capability to multilingual, multi-client environments. It maintains comparable accuracy with significantly lower delay -- around 2.4 s with 5 clients -- and continues to scale effectively up to 20 concurrent clients without degrading transcription quality and increasing overall throughput. Our approach advances scalable ASR by improving robustness and efficiency in dynamic, multi-user environments.

Problem

Research questions and friction points this paper is trying to address.

real-time ASR

scalability

multilingual transcription

low latency

concurrent clients

Innovation

Methods, ideas, or system contributions that make the work stand out.

real-time ASR

model-level parallelization

buffer merging