🤖 AI Summary
This work addresses the issue of content hallucination and modality misalignment in existing text-to-speech (TTS) systems, which stems from the use of fixed-frame-rate acoustic tokens that produce speech sequences significantly longer than the input text. To resolve this, the authors propose a novel tokenization scheme that enforces strict temporal alignment between text and acoustic features, achieving one-to-one synchronization for the first time. Building upon flow matching in a latent space, the approach unifies speech and text within a single-stream large language model (LLM) framework. A text-guided hybrid logits generation mechanism is introduced to enable seamless switching and coherent integration between pure text and speech modalities. Experiments demonstrate that the method nearly eliminates content hallucination while preserving linguistic integrity, achieves performance on par with state-of-the-art TTS and spoken language models (SLMs), and substantially reduces inference costs.
📝 Abstract
Modern Text-to-Speech (TTS) systems increasingly leverage Large Language Model (LLM) architectures to achieve scalable, high-fidelity, zero-shot generation. However, these systems typically rely on fixed-frame-rate acoustic tokenization, resulting in speech sequences that are significantly longer than, and asynchronous with their corresponding text. Beyond computational inefficiency, this sequence length disparity often triggers hallucinations in TTS and amplifies the modality gap in spoken language modeling (SLM). In this paper, we propose a novel tokenization scheme that establishes one-to-one synchronization between continuous acoustic features and text tokens, enabling unified, single-stream modeling within an LLM. We demonstrate that these synchronous tokens maintain high-fidelity audio reconstruction and can be effectively modeled in a latent space by a large language model with a flow matching head. Moreover, the ability to seamlessly toggle speech modality within the context enables text-only guidance--a technique that blends logits from text-only and text-speech modes to flexibly bridge the gap toward text-only LLM intelligence. Experimental results indicate that our approach achieves performance competitive with state-of-the-art TTS and SLM systems while virtually eliminating content hallucinations and preserving linguistic integrity, all at a significantly reduced inference cost.