🤖 AI Summary
This work addresses the challenge of multimodal audio-text joint understanding by introducing Voxtral Mini and Voxtral Small—lightweight, open-source multimodal dialogue models. Methodologically, we propose a dual-path deep collaboration architecture for audio and text, integrating large-scale speech-text contrastive pretraining with long-sequence modeling to support 32K-context windows (up to 40 minutes of audio) and multi-turn, complex semantic reasoning. Key contributions include: (1) the first construction of three novel evaluation benchmarks specifically designed for speech knowledge understanding; (2) Voxtral Small outperforming leading closed-source models on multiple audio understanding benchmarks while retaining strong general language capabilities; and (3) full model release under the Apache 2.0 license, enabling efficient local deployment. Our results substantially advance the state-of-the-art in open multimodal language models’ speech comprehension capabilities.
📝 Abstract
We present Voxtral Mini and Voxtral Small, two multimodal audio chat models. Voxtral is trained to comprehend both spoken audio and text documents, achieving state-of-the-art performance across a diverse range of audio benchmarks, while preserving strong text capabilities. Voxtral Small outperforms a number of closed-source models, while being small enough to run locally. A 32K context window enables the model to handle audio files up to 40 minutes in duration and long multi-turn conversations. We also contribute three benchmarks for evaluating speech understanding models on knowledge and trivia. Both Voxtral models are released under Apache 2.0 license.