Koel-TTS: Enhancing LLM based Speech Generation with Preference Alignment and Classifier Free Guidance

📅 2025-02-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Autoregressive speech token generation models suffer from hallucination, pronunciation inaccuracy, and speaker deviation, limiting controllability and naturalness. This paper proposes an end-to-end speech synthesis framework that directly models the mapping from text and reference speech to acoustic tokens. Our method employs an encoder-decoder Transformer architecture jointly optimized via preference-based learning and speech feedback constraints. Key contributions include: (1) the first multi-objective preference alignment mechanism integrating ASR recognition accuracy and speaker verification similarity; and (2) classifier-free guidance enabling fine-grained controllable generation from few-shot references. Experiments demonstrate state-of-the-art performance across speaker similarity, intelligibility, and naturalness—achieving superior results with a significantly smaller training dataset compared to prior approaches.

Technology Category

Application Category

📝 Abstract
While autoregressive speech token generation models produce speech with remarkable variety and naturalness, their inherent lack of controllability often results in issues such as hallucinations and undesired vocalizations that do not conform to conditioning inputs. We introduce Koel-TTS, a suite of enhanced encoder-decoder Transformer TTS models that address these challenges by incorporating preference alignment techniques guided by automatic speech recognition and speaker verification models. Additionally, we incorporate classifier-free guidance to further improve synthesis adherence to the transcript and reference speaker audio. Our experiments demonstrate that these optimizations significantly enhance target speaker similarity, intelligibility, and naturalness of synthesized speech. Notably, Koel-TTS directly maps text and context audio to acoustic tokens, and on the aforementioned metrics, outperforms state-of-the-art TTS models, despite being trained on a significantly smaller dataset. Audio samples and demos are available on our website.
Problem

Research questions and friction points this paper is trying to address.

Enhances controllability of speech generation
Reduces hallucinations and undesired vocalizations
Improves adherence to transcript and speaker audio
Innovation

Methods, ideas, or system contributions that make the work stand out.

Preference alignment techniques
Classifier-free guidance
Enhanced encoder-decoder Transformer
🔎 Similar Papers
No similar papers found.