Voxtral

📅 2025-07-17

📈 Citations: 0

✨ Influential: 0

career value

234K/year

🤖 AI Summary

This work addresses the challenge of multimodal audio-text joint understanding by introducing Voxtral Mini and Voxtral Small—lightweight, open-source multimodal dialogue models. Methodologically, we propose a dual-path deep collaboration architecture for audio and text, integrating large-scale speech-text contrastive pretraining with long-sequence modeling to support 32K-context windows (up to 40 minutes of audio) and multi-turn, complex semantic reasoning. Key contributions include: (1) the first construction of three novel evaluation benchmarks specifically designed for speech knowledge understanding; (2) Voxtral Small outperforming leading closed-source models on multiple audio understanding benchmarks while retaining strong general language capabilities; and (3) full model release under the Apache 2.0 license, enabling efficient local deployment. Our results substantially advance the state-of-the-art in open multimodal language models’ speech comprehension capabilities.

Technology Category

Application Category

📝 Abstract

We present Voxtral Mini and Voxtral Small, two multimodal audio chat models. Voxtral is trained to comprehend both spoken audio and text documents, achieving state-of-the-art performance across a diverse range of audio benchmarks, while preserving strong text capabilities. Voxtral Small outperforms a number of closed-source models, while being small enough to run locally. A 32K context window enables the model to handle audio files up to 40 minutes in duration and long multi-turn conversations. We also contribute three benchmarks for evaluating speech understanding models on knowledge and trivia. Both Voxtral models are released under Apache 2.0 license.

Problem

Research questions and friction points this paper is trying to address.

Develop multimodal models for audio and text comprehension

Achieve state-of-the-art performance in diverse audio benchmarks

Enable local execution with efficient long-context handling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal audio-text comprehension model

32K context window for long audio

Outperforms closed-source models locally

🔎 Similar Papers

No similar papers found.