Voxtral TTS

📅 2026-03-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of achieving high-quality, expressive multilingual text-to-speech synthesis and voice cloning using only extremely short reference audio clips—such as 3 seconds—by proposing a hybrid generative architecture. In this framework, semantic speech tokens are modeled autoregressively, while acoustic tokens are generated via flow matching. A newly designed Voxtral codec, trained from scratch and based on hybrid VQ-FSQ quantization, is introduced to optimize speech representations. Evaluated within an end-to-end multilingual TTS system, the approach significantly enhances the naturalness and expressiveness of few-shot synthesized speech. Human evaluations by native speakers demonstrate that the voice cloning quality surpasses that of ElevenLabs Flash v2.5, with a win rate of 68.4%. The model weights have been publicly released.

Technology Category

Application Category

📝 Abstract
We introduce Voxtral TTS, an expressive multilingual text-to-speech model that generates natural speech from as little as 3 seconds of reference audio. Voxtral TTS adopts a hybrid architecture that combines auto-regressive generation of semantic speech tokens with flow-matching for acoustic tokens. These tokens are encoded and decoded with Voxtral Codec, a speech tokenizer trained from scratch with a hybrid VQ-FSQ quantization scheme. In human evaluations conducted by native speakers, Voxtral TTS is preferred for multilingual voice cloning due to its naturalness and expressivity, achieving a 68.4\% win rate over ElevenLabs Flash v2.5. We release the model weights under a CC BY-NC license.
Problem

Research questions and friction points this paper is trying to address.

text-to-speech
voice cloning
multilingual
expressive speech
few-shot
Innovation

Methods, ideas, or system contributions that make the work stand out.

Voxtral TTS
hybrid architecture
flow-matching
VQ-FSQ quantization
multilingual voice cloning
🔎 Similar Papers
No similar papers found.