LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM

📅 2025-03-06

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Current speech-to-speech dialogue systems suffer from LLM dependency, high computational overhead, text–speech misalignment, and degradation of LLM capabilities. This paper proposes LLMVoX—a lightweight (30M-parameter), LLM-agnostic streaming TTS framework featuring a novel multi-queue token-stream architecture that fully decouples speech synthesis from LLM inference, requiring no LLM architectural modifications or multimodal fine-tuning. It enables real-time, low-latency, high-fidelity speech synthesis with arbitrary LLMs, supporting unlimited-length dialogues. LLMVoX offers zero-shot cross-modal extensibility—unifying speech, text, and vision—and zero-shot cross-lingual transfer without adaptation. Experiments show LLMVoX achieves significantly lower WERR than state-of-the-art speech-enhanced LLMs, comparable UTMOS scores, and lower Arabic CER. It has been successfully integrated with a vision-language model to construct a training-free trimodal system.

Technology Category

Application Category

📝 Abstract

Recent advancements in speech-to-speech dialogue systems leverage LLMs for multimodal interactions, yet they remain hindered by fine-tuning requirements, high computational overhead, and text-speech misalignment. Existing speech-enabled LLMs often degrade conversational quality by modifying the LLM, thereby compromising its linguistic capabilities. In contrast, we propose LLMVoX, a lightweight 30M-parameter, LLM-agnostic, autoregressive streaming TTS system that generates high-quality speech with low latency, while fully preserving the capabilities of the base LLM. Our approach achieves a significantly lower Word Error Rate compared to speech-enabled LLMs, while operating at comparable latency and UTMOS score. By decoupling speech synthesis from LLM processing via a multi-queue token streaming system, LLMVoX supports seamless, infinite-length dialogues. Its plug-and-play design also facilitates extension to various tasks with different backbones. Furthermore, LLMVoX generalizes to new languages with only dataset adaptation, attaining a low Character Error Rate on an Arabic speech task. Additionally, we have integrated LLMVoX with a Vision-Language Model to create an omni-model with speech, text, and vision capabilities, without requiring additional multimodal training. Our code base and project page is available at https://mbzuai-oryx.github.io/LLMVoX .

Problem

Research questions and friction points this paper is trying to address.

Overcomes fine-tuning and computational overhead in speech-to-speech systems.

Preserves LLM linguistic capabilities without modification.

Enables seamless, infinite-length dialogues with low latency.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight autoregressive streaming TTS system

Decouples speech synthesis from LLM processing

Supports seamless infinite-length dialogues

🔎 Similar Papers

Recent Advances in Speech Language Models: A Survey

2024-10-01arXiv.orgCitations: 2