SLIDE: Integrating Speech Language Model with LLM for Spontaneous Spoken Dialogue Generation

📅 2025-01-01

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

To address poor semantic coherence and inadequate nonverbal acoustic modeling in spontaneous spoken dialogue generation, this paper proposes an end-to-end framework synergizing large language models (LLMs) and speech-language models (SLMs). First, an LLM generates semantically consistent textual dialogues; then, phoneme sequence modeling and a novel two-tower Transformer-based duration predictor—introducing semantic-guided joint phoneme duration modeling—are employed; finally, high-fidelity speech synthesis is achieved. This approach establishes a text-to-speech joint optimization framework. Evaluated on the Fisher corpus, it significantly improves naturalness (MOS 4.12) and semantic fidelity (BLEU-4 increase of 23.6%) while reducing disfluency by 37%. To our knowledge, this is the first work enabling semantic-controllable and acoustically natural spoken dialogue generation driven entirely by LLMs.

Technology Category

Application Category

📝 Abstract

Recently, ``textless"speech language models (SLMs) based on speech units have made huge progress in generating naturalistic speech, including non-verbal vocalizations. However, the generated speech samples often lack semantic coherence. In this paper, we propose SLM and LLM Integration for spontaneous spoken Dialogue gEneration (SLIDE). Specifically, we first utilize an LLM to generate the textual content of spoken dialogue. Next, we convert the textual dialogues into phoneme sequences and use a two-tower transformer-based duration predictor to predict the duration of each phoneme. Finally, an SLM conditioned on the spoken phoneme sequences is used to vocalize the textual dialogue. Experimental results on the Fisher dataset demonstrate that our system can generate naturalistic spoken dialogue while maintaining high semantic coherence.

Problem

Research questions and friction points this paper is trying to address.

Natural Language Generation

Oral Dialogue

Non-verbal Sounds

Innovation

Methods, ideas, or system contributions that make the work stand out.

SLMs and LLMs Integration

Two-tower Transformer-based Duration Predictor

Semantic Coherence in Spoken Dialogue Generation

🔎 Similar Papers

A Survey on Recent Advances in LLM-Based Multi-turn Dialogue Systems