Sink or SWIM: Tackling Real-Time ASR at Scale

📅 2026-01-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of achieving low latency and high accuracy in real-time automatic speech recognition (ASR) under multi-user concurrent scenarios. The authors propose SWIM, a system built upon the OpenAI Whisper model that enables multilingual, multi-client streaming transcription without modifying the original model architecture. By leveraging model-level parallelism and a buffer-merging mechanism, SWIM significantly improves resource efficiency and throughput while preserving transcription quality. Experimental results demonstrate that, with 20 concurrent streams supporting English, Italian, and Spanish, the system achieves an average latency of 2.4 seconds for five simultaneous clients and maintains a stable word error rate of approximately 8.2%.

Technology Category

Application Category

📝 Abstract
Real-time automatic speech recognition systems are increasingly integrated into interactive applications, from voice assistants to live transcription services. However, scaling these systems to support multiple concurrent clients while maintaining low latency and high accuracy remains a major challenge. In this work, we present SWIM, a novel real-time ASR system built on top of OpenAI's Whisper model that enables true model-level parallelization for scalable, multilingual transcription. SWIM supports multiple concurrent audio streams without modifying the underlying model. It introduces a buffer merging strategy that maintains transcription fidelity while ensuring efficient resource usage. We evaluate SWIM in multi-client settings -- scaling up to 20 concurrent users -- and show that it delivers accurate real-time transcriptions in English, Italian, and Spanish, while maintaining low latency and high throughput. While Whisper-Streaming achieves a word error rate of approximately 8.2% with an average delay of approximately 3.4 s in a single-client, English-only setting, SWIM extends this capability to multilingual, multi-client environments. It maintains comparable accuracy with significantly lower delay -- around 2.4 s with 5 clients -- and continues to scale effectively up to 20 concurrent clients without degrading transcription quality and increasing overall throughput. Our approach advances scalable ASR by improving robustness and efficiency in dynamic, multi-user environments.
Problem

Research questions and friction points this paper is trying to address.

real-time ASR
scalability
multilingual transcription
low latency
concurrent clients
Innovation

Methods, ideas, or system contributions that make the work stand out.

real-time ASR
model-level parallelization
buffer merging
multilingual transcription
scalable speech recognition
🔎 Similar Papers
No similar papers found.
F
Federico Bruzzone
Department of Computer Science, Università degli Studi di Milano, Milan, Italy
Walter Cazzola
Walter Cazzola
Full Professor, Università degli Studi di Milano
programming languagesprogramming techniquesprogramming language design and implementationreflection and aspect-oriented programmingdynamic software evolution
M
Matteo Brancaleoni
Computer Science Division, VoiSmart, Milan, Italy
D
Dario Pellegrino
Computer Science Division, VoiSmart, Milan, Italy