Voxtral Realtime

📅 2026-02-11

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

This work proposes a natively streaming end-to-end automatic speech recognition (ASR) model to overcome the performance degradation commonly observed in conventional streaming ASR systems that rely on chunking or sliding windows. Built upon the Delayed Streams Modeling framework, the approach employs a causal audio encoder and an Ada RMS-Norm normalization mechanism to explicitly align audio and text streams. The model achieves transcription accuracy comparable to Whisper at a latency of 480 ms. It is pretrained on a large-scale multilingual dataset covering 13 languages and represents the first end-to-end streaming ASR system to match the performance of state-of-the-art offline models. The code and model weights are publicly released under the Apache 2.0 license.

Technology Category

Application Category

📝 Abstract

We introduce Voxtral Realtime, a natively streaming automatic speech recognition model that matches offline transcription quality at sub-second latency. Unlike approaches that adapt offline models through chunking or sliding windows, Voxtral Realtime is trained end-to-end for streaming, with explicit alignment between audio and text streams. Our architecture builds on the Delayed Streams Modeling framework, introducing a new causal audio encoder and Ada RMS-Norm for improved delay conditioning. We scale pretraining to a large-scale dataset spanning 13 languages. At a delay of 480ms, Voxtral Realtime achieves performance on par with Whisper, the most widely deployed offline transcription system. We release the model weights under the Apache 2.0 license.

Problem

Research questions and friction points this paper is trying to address.

streaming automatic speech recognition

sub-second latency

real-time transcription

offline transcription quality

causal audio encoding

Innovation

Methods, ideas, or system contributions that make the work stand out.

streaming ASR

end-to-end training

causal audio encoder

delay conditioning

multilingual pretraining

🔎 Similar Papers

No similar papers found.