🤖 AI Summary
To address poor semantic coherence and inadequate nonverbal acoustic modeling in spontaneous spoken dialogue generation, this paper proposes an end-to-end framework synergizing large language models (LLMs) and speech-language models (SLMs). First, an LLM generates semantically consistent textual dialogues; then, phoneme sequence modeling and a novel two-tower Transformer-based duration predictor—introducing semantic-guided joint phoneme duration modeling—are employed; finally, high-fidelity speech synthesis is achieved. This approach establishes a text-to-speech joint optimization framework. Evaluated on the Fisher corpus, it significantly improves naturalness (MOS 4.12) and semantic fidelity (BLEU-4 increase of 23.6%) while reducing disfluency by 37%. To our knowledge, this is the first work enabling semantic-controllable and acoustically natural spoken dialogue generation driven entirely by LLMs.
📝 Abstract
Recently, ``textless"speech language models (SLMs) based on speech units have made huge progress in generating naturalistic speech, including non-verbal vocalizations. However, the generated speech samples often lack semantic coherence. In this paper, we propose SLM and LLM Integration for spontaneous spoken Dialogue gEneration (SLIDE). Specifically, we first utilize an LLM to generate the textual content of spoken dialogue. Next, we convert the textual dialogues into phoneme sequences and use a two-tower transformer-based duration predictor to predict the duration of each phoneme. Finally, an SLM conditioned on the spoken phoneme sequences is used to vocalize the textual dialogue. Experimental results on the Fisher dataset demonstrate that our system can generate naturalistic spoken dialogue while maintaining high semantic coherence.