Think Before You Talk: Enhancing Meaningful Dialogue Generation in Full-Duplex Speech Language Models with Planning-Inspired Text Guidance

📅 2025-08-10

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Full-duplex speech-language models (FD-SLMs) face two key challenges: semantic degradation due to long speech sequences and scarcity of high-quality dual-channel conversational data; and poor temporal alignment and length mismatch in text-guided speech generation, which impair natural interaction. To address these, we propose TurnGuide—a method inspired by human pre-utterance planning—dynamically segments conversations into turns and generates turn-level textual guidance to ensure precise temporal alignment and length adaptation. TurnGuide is embedded within an end-to-end FD-SLM framework that jointly models overlapping speech, interruptions, and backchannel feedback, and is trained on dual-channel conversational data. Experiments demonstrate significant improvements in semantic coherence and content richness of generated speech, while preserving natural dialogue rhythm and enhancing interactive quality. TurnGuide establishes a scalable, planning-based guidance paradigm for real-time spoken interaction.

Technology Category

Application Category

📝 Abstract

Full-Duplex Speech Language Models (FD-SLMs) are specialized foundation models designed to enable natural, real-time spoken interactions by modeling complex conversational dynamics such as interruptions, backchannels, and overlapping speech, and End-to-end (e2e) FD-SLMs leverage real-world double-channel conversational data to capture nuanced two-speaker dialogue patterns for human-like interactions. However, they face a critical challenge -- their conversational abilities often degrade compared to pure-text conversation due to prolonged speech sequences and limited high-quality spoken dialogue data. While text-guided speech generation could mitigate these issues, it suffers from timing and length issues when integrating textual guidance into double-channel audio streams, disrupting the precise time alignment essential for natural interactions. To address these challenges, we propose TurnGuide, a novel planning-inspired approach that mimics human conversational planning by dynamically segmenting assistant speech into dialogue turns and generating turn-level text guidance before speech output, which effectively resolves both insertion timing and length challenges. Extensive experiments demonstrate our approach significantly improves e2e FD-SLMs' conversational abilities, enabling them to generate semantically meaningful and coherent speech while maintaining natural conversational flow. Demos are available at https://dreamtheater123.github.io/TurnGuide-Demo/. Code will be available at https://github.com/dreamtheater123/TurnGuide.

Problem

Research questions and friction points this paper is trying to address.

FD-SLMs degrade in conversational quality with prolonged speech sequences

Text-guided speech generation disrupts timing in double-channel audio

Limited high-quality spoken dialogue data hinders human-like interactions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic turn segmentation for dialogue planning

Turn-level text guidance before speech output

Resolves timing and length alignment challenges

🔎 Similar Papers

No similar papers found.