LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM

📅 2025-03-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current speech-to-speech dialogue systems suffer from LLM dependency, high computational overhead, text–speech misalignment, and degradation of LLM capabilities. This paper proposes LLMVoX—a lightweight (30M-parameter), LLM-agnostic streaming TTS framework featuring a novel multi-queue token-stream architecture that fully decouples speech synthesis from LLM inference, requiring no LLM architectural modifications or multimodal fine-tuning. It enables real-time, low-latency, high-fidelity speech synthesis with arbitrary LLMs, supporting unlimited-length dialogues. LLMVoX offers zero-shot cross-modal extensibility—unifying speech, text, and vision—and zero-shot cross-lingual transfer without adaptation. Experiments show LLMVoX achieves significantly lower WERR than state-of-the-art speech-enhanced LLMs, comparable UTMOS scores, and lower Arabic CER. It has been successfully integrated with a vision-language model to construct a training-free trimodal system.

Technology Category

Application Category

📝 Abstract
Recent advancements in speech-to-speech dialogue systems leverage LLMs for multimodal interactions, yet they remain hindered by fine-tuning requirements, high computational overhead, and text-speech misalignment. Existing speech-enabled LLMs often degrade conversational quality by modifying the LLM, thereby compromising its linguistic capabilities. In contrast, we propose LLMVoX, a lightweight 30M-parameter, LLM-agnostic, autoregressive streaming TTS system that generates high-quality speech with low latency, while fully preserving the capabilities of the base LLM. Our approach achieves a significantly lower Word Error Rate compared to speech-enabled LLMs, while operating at comparable latency and UTMOS score. By decoupling speech synthesis from LLM processing via a multi-queue token streaming system, LLMVoX supports seamless, infinite-length dialogues. Its plug-and-play design also facilitates extension to various tasks with different backbones. Furthermore, LLMVoX generalizes to new languages with only dataset adaptation, attaining a low Character Error Rate on an Arabic speech task. Additionally, we have integrated LLMVoX with a Vision-Language Model to create an omni-model with speech, text, and vision capabilities, without requiring additional multimodal training. Our code base and project page is available at https://mbzuai-oryx.github.io/LLMVoX .
Problem

Research questions and friction points this paper is trying to address.

Overcomes fine-tuning and computational overhead in speech-to-speech systems.
Preserves LLM linguistic capabilities without modification.
Enables seamless, infinite-length dialogues with low latency.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight autoregressive streaming TTS system
Decouples speech synthesis from LLM processing
Supports seamless infinite-length dialogues
🔎 Similar Papers
No similar papers found.
S
Sambal Shikhar
Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI), UAE
M
Mohammed Irfan Kurpath
Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI), UAE
Sahal Shaji Mullappilly
Sahal Shaji Mullappilly
PhD Computer Vision Student, MBZUAI
Vision Language ModelsComputer VisionObject DetectionReal-time models
Jean Lahoud
Jean Lahoud
Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)
Computer Vision
F
Fahad Khan
Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI), UAE; Linköping University, Sweden
R
R. Anwer
Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI), UAE
S
Salman Khan
Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI), UAE
Hisham Cholakkal
Hisham Cholakkal
Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)
Computer VisionLarge Multimodal ModelsLLMHealthcare Foundation ModelConversational Assistant