SLIDE: Integrating Speech Language Model with LLM for Spontaneous Spoken Dialogue Generation

📅 2025-01-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address poor semantic coherence and inadequate nonverbal acoustic modeling in spontaneous spoken dialogue generation, this paper proposes an end-to-end framework synergizing large language models (LLMs) and speech-language models (SLMs). First, an LLM generates semantically consistent textual dialogues; then, phoneme sequence modeling and a novel two-tower Transformer-based duration predictor—introducing semantic-guided joint phoneme duration modeling—are employed; finally, high-fidelity speech synthesis is achieved. This approach establishes a text-to-speech joint optimization framework. Evaluated on the Fisher corpus, it significantly improves naturalness (MOS 4.12) and semantic fidelity (BLEU-4 increase of 23.6%) while reducing disfluency by 37%. To our knowledge, this is the first work enabling semantic-controllable and acoustically natural spoken dialogue generation driven entirely by LLMs.

Technology Category

Application Category

📝 Abstract
Recently, ``textless"speech language models (SLMs) based on speech units have made huge progress in generating naturalistic speech, including non-verbal vocalizations. However, the generated speech samples often lack semantic coherence. In this paper, we propose SLM and LLM Integration for spontaneous spoken Dialogue gEneration (SLIDE). Specifically, we first utilize an LLM to generate the textual content of spoken dialogue. Next, we convert the textual dialogues into phoneme sequences and use a two-tower transformer-based duration predictor to predict the duration of each phoneme. Finally, an SLM conditioned on the spoken phoneme sequences is used to vocalize the textual dialogue. Experimental results on the Fisher dataset demonstrate that our system can generate naturalistic spoken dialogue while maintaining high semantic coherence.
Problem

Research questions and friction points this paper is trying to address.

Natural Language Generation
Oral Dialogue
Non-verbal Sounds
Innovation

Methods, ideas, or system contributions that make the work stand out.

SLMs and LLMs Integration
Two-tower Transformer-based Duration Predictor
Semantic Coherence in Spoken Dialogue Generation
🔎 Similar Papers
No similar papers found.
H
Haitian Lu
Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, CAS, China; University of Chinese Academy of Sciences, China
Gaofeng Cheng
Gaofeng Cheng
Institute of Acoustics, Chinese Academy of Sciences
Speech ProcessingAcousticsRepresentation LearningMachine Learning
L
Liuping Luo
Guangdong Provincial Public Security Department, Guangzhou, China
L
Leying Zhang
AudioCC Lab, MoE Key Lab of AI, CS Dept, Shanghai Jiao Tong University, Shanghai, China
Yanmin Qian
Yanmin Qian
Professor, Shanghai Jiao Tong University
Speech and Language ProcessingSignal ProcessingMachine Learning
P
Pengyuan Zhang
Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, CAS, China; University of Chinese Academy of Sciences, China