Chain-of-Thought Reasoning in Streaming Full-Duplex End-to-End Spoken Dialogue Systems

📅 2025-10-02

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

Current full-duplex end-to-end spoken dialogue systems rely heavily on voice activity detection (VAD), which struggles to distinguish between brief pauses and genuine speech termination, and exhibit weaker semantic reasoning capabilities compared to cascaded systems. To address these limitations, we propose SCoT—a novel framework that introduces streaming Chain-of-Thought (CoT) reasoning into full-duplex end-to-end modeling for the first time. SCoT employs frame-level alignment to construct intermediate supervision signals and processes user speech input and response generation in alternating, chunked segments—enabling fine-grained temporal synchronization and semantically coherent response generation without VAD. Experimental results demonstrate that SCoT significantly outperforms existing full-duplex approaches in response coherence, while simultaneously reducing latency and enhancing interaction naturalness. It effectively bridges the gap between end-to-end and cascaded systems in both semantic reasoning capability and real-time performance.

Technology Category

Application Category

📝 Abstract

Most end-to-end (E2E) spoken dialogue systems (SDS) rely on voice activity detection (VAD) for turn-taking, but VAD fails to distinguish between pauses and turn completions. Duplex SDS models address this by predicting output continuously, including silence tokens, thus removing the need for explicit VAD. However, they often have complex dual-channel architecture and lag behind cascaded models in semantic reasoning. To overcome these challenges, we propose SCoT: a Streaming Chain-of-Thought (CoT) framework for Duplex SDS, alternating between processing fixed-duration user input and generating responses in a blockwise manner. Using frame-level alignments, we create intermediate targets-aligned user transcripts and system responses for each block. Experiments show that our approach produces more coherent and interpretable responses than existing duplex methods while supporting lower-latency and overlapping interactions compared to turn-by-turn systems.

Problem

Research questions and friction points this paper is trying to address.

Distinguish between pauses and turn completions in dialogue systems

Simplify complex dual-channel architecture in duplex SDS models

Enhance semantic reasoning while maintaining low-latency interactions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Streaming Chain-of-Thought framework for duplex dialogue systems

Blockwise processing with aligned transcripts and responses

Frame-level alignments create intermediate targets for coherence

🔎 Similar Papers

No similar papers found.