Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis

📅 2025-02-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM-based text-to-speech (TTS) systems predominantly adopt multi-stage architectures (e.g., LLM + diffusion model), leading to complex computational scaling decisions during both training and inference. Method: We propose Llasa, an end-to-end TTS framework that pioneers *training-inference unified compute scaling* for speech synthesis. It employs a single-layer vector-quantized codec paired with a Llama-aligned Transformer architecture, enabling unified training across 1B/3B/8B model scales. Additionally, we introduce a speech understanding model as an inference-time verifier to enable verifier-guided autoregressive sampling. Contribution/Results: Llasa achieves a single-model, natively Llama-compatible, and multi-scale scalable TTS design. It significantly improves naturalness, prosodic complexity, emotional expressiveness, timbre consistency, and content fidelity. All models and code are open-sourced.

Technology Category

Application Category

📝 Abstract
Recent advances in text-based large language models (LLMs), particularly in the GPT series and the o1 model, have demonstrated the effectiveness of scaling both training-time and inference-time compute. However, current state-of-the-art TTS systems leveraging LLMs are often multi-stage, requiring separate models (e.g., diffusion models after LLM), complicating the decision of whether to scale a particular model during training or testing. This work makes the following contributions: First, we explore the scaling of train-time and inference-time compute for speech synthesis. Second, we propose a simple framework Llasa for speech synthesis that employs a single-layer vector quantizer (VQ) codec and a single Transformer architecture to fully align with standard LLMs such as Llama. Our experiments reveal that scaling train-time compute for Llasa consistently improves the naturalness of synthesized speech and enables the generation of more complex and accurate prosody patterns. Furthermore, from the perspective of scaling inference-time compute, we employ speech understanding models as verifiers during the search, finding that scaling inference-time compute shifts the sampling modes toward the preferences of specific verifiers, thereby improving emotional expressiveness, timbre consistency, and content accuracy. In addition, we released the checkpoint and training code for our TTS model (1B, 3B, 8B) and codec model publicly available.
Problem

Research questions and friction points this paper is trying to address.

Scaling train-time and inference-time compute
Simplified framework for speech synthesis
Improving naturalness and prosody patterns
Innovation

Methods, ideas, or system contributions that make the work stand out.

Single-layer VQ codec
Transformer architecture alignment
Scaling compute for speech synthesis
🔎 Similar Papers
No similar papers found.
Z
Zhen Ye
The Hong Kong University of Science and Technology
Xinfa Zhu
Xinfa Zhu
Northwestern Polytechnical University
speech generation
Chi-Min Chan
Chi-Min Chan
HKUST
Large Language ModelsPost-TrainingAlignmentLLM Agents
Xinsheng Wang
Xinsheng Wang
Hong Kong University of Science and Technology (HKUST)
speech synthesissinging voice synthesisvoice conversion
X
Xu Tan
Independent Researcher
J
Jiahe Lei
University of Science and Technology Beijing
Yi Peng
Yi Peng
Bytedance
Machine LearningImage ProcessingVisualization
Haohe Liu
Haohe Liu
Research Scientist at Meta AI
Audio GenerationAudio ClassificationSpeech Quality EnhancementMusic Source Separation
Y
Yizhu Jin
The Hong Kong University of Science and Technology
Z
Zheqi DAI
Chinese University of Hong Kong
Hongzhan Lin
Hongzhan Lin
Hong Kong Baptist University
Natural Language ProcessingMultimodal ReasoningSocial Computing
J
Jianyi Chen
The Hong Kong University of Science and Technology
X
Xingjian Du
University of Rochester
Liumeng Xue
Liumeng Xue
Hong Kong University of Science and Technology
Audio Speech and Language ProcessingSpeech Generation
Yunlin Chen
Yunlin Chen
Mobvoi
speechavatar
Zhifei Li
Zhifei Li
Research Scientist at Google
machine translationnatural language processingmachine learningwireless networks
L
Lei Xie
ASLP Lab, Northwestern Polytechnical University
Qiuqiang Kong
Qiuqiang Kong
The Chinese University of Hong Kong
Audio ProcessingArtificial Intelligence
Y
Yike Guo
The Hong Kong University of Science and Technology
W
Wei Xue
The Hong Kong University of Science and Technology