Voxtral Realtime

📅 2026-02-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes a natively streaming end-to-end automatic speech recognition (ASR) model to overcome the performance degradation commonly observed in conventional streaming ASR systems that rely on chunking or sliding windows. Built upon the Delayed Streams Modeling framework, the approach employs a causal audio encoder and an Ada RMS-Norm normalization mechanism to explicitly align audio and text streams. The model achieves transcription accuracy comparable to Whisper at a latency of 480 ms. It is pretrained on a large-scale multilingual dataset covering 13 languages and represents the first end-to-end streaming ASR system to match the performance of state-of-the-art offline models. The code and model weights are publicly released under the Apache 2.0 license.

Technology Category

Application Category

📝 Abstract
We introduce Voxtral Realtime, a natively streaming automatic speech recognition model that matches offline transcription quality at sub-second latency. Unlike approaches that adapt offline models through chunking or sliding windows, Voxtral Realtime is trained end-to-end for streaming, with explicit alignment between audio and text streams. Our architecture builds on the Delayed Streams Modeling framework, introducing a new causal audio encoder and Ada RMS-Norm for improved delay conditioning. We scale pretraining to a large-scale dataset spanning 13 languages. At a delay of 480ms, Voxtral Realtime achieves performance on par with Whisper, the most widely deployed offline transcription system. We release the model weights under the Apache 2.0 license.
Problem

Research questions and friction points this paper is trying to address.

streaming automatic speech recognition
sub-second latency
real-time transcription
offline transcription quality
causal audio encoding
Innovation

Methods, ideas, or system contributions that make the work stand out.

streaming ASR
end-to-end training
causal audio encoder
delay conditioning
multilingual pretraining
🔎 Similar Papers
No similar papers found.