🤖 AI Summary
Current speech-to-speech dialogue systems suffer from LLM dependency, high computational overhead, text–speech misalignment, and degradation of LLM capabilities. This paper proposes LLMVoX—a lightweight (30M-parameter), LLM-agnostic streaming TTS framework featuring a novel multi-queue token-stream architecture that fully decouples speech synthesis from LLM inference, requiring no LLM architectural modifications or multimodal fine-tuning. It enables real-time, low-latency, high-fidelity speech synthesis with arbitrary LLMs, supporting unlimited-length dialogues. LLMVoX offers zero-shot cross-modal extensibility—unifying speech, text, and vision—and zero-shot cross-lingual transfer without adaptation. Experiments show LLMVoX achieves significantly lower WERR than state-of-the-art speech-enhanced LLMs, comparable UTMOS scores, and lower Arabic CER. It has been successfully integrated with a vision-language model to construct a training-free trimodal system.
📝 Abstract
Recent advancements in speech-to-speech dialogue systems leverage LLMs for multimodal interactions, yet they remain hindered by fine-tuning requirements, high computational overhead, and text-speech misalignment. Existing speech-enabled LLMs often degrade conversational quality by modifying the LLM, thereby compromising its linguistic capabilities. In contrast, we propose LLMVoX, a lightweight 30M-parameter, LLM-agnostic, autoregressive streaming TTS system that generates high-quality speech with low latency, while fully preserving the capabilities of the base LLM. Our approach achieves a significantly lower Word Error Rate compared to speech-enabled LLMs, while operating at comparable latency and UTMOS score. By decoupling speech synthesis from LLM processing via a multi-queue token streaming system, LLMVoX supports seamless, infinite-length dialogues. Its plug-and-play design also facilitates extension to various tasks with different backbones. Furthermore, LLMVoX generalizes to new languages with only dataset adaptation, attaining a low Character Error Rate on an Arabic speech task. Additionally, we have integrated LLMVoX with a Vision-Language Model to create an omni-model with speech, text, and vision capabilities, without requiring additional multimodal training. Our code base and project page is available at https://mbzuai-oryx.github.io/LLMVoX .