🤖 AI Summary
Current full-duplex end-to-end spoken dialogue systems rely heavily on voice activity detection (VAD), which struggles to distinguish between brief pauses and genuine speech termination, and exhibit weaker semantic reasoning capabilities compared to cascaded systems. To address these limitations, we propose SCoT—a novel framework that introduces streaming Chain-of-Thought (CoT) reasoning into full-duplex end-to-end modeling for the first time. SCoT employs frame-level alignment to construct intermediate supervision signals and processes user speech input and response generation in alternating, chunked segments—enabling fine-grained temporal synchronization and semantically coherent response generation without VAD. Experimental results demonstrate that SCoT significantly outperforms existing full-duplex approaches in response coherence, while simultaneously reducing latency and enhancing interaction naturalness. It effectively bridges the gap between end-to-end and cascaded systems in both semantic reasoning capability and real-time performance.
📝 Abstract
Most end-to-end (E2E) spoken dialogue systems (SDS) rely on voice activity detection (VAD) for turn-taking, but VAD fails to distinguish between pauses and turn completions. Duplex SDS models address this by predicting output continuously, including silence tokens, thus removing the need for explicit VAD. However, they often have complex dual-channel architecture and lag behind cascaded models in semantic reasoning. To overcome these challenges, we propose SCoT: a Streaming Chain-of-Thought (CoT) framework for Duplex SDS, alternating between processing fixed-duration user input and generating responses in a blockwise manner. Using frame-level alignments, we create intermediate targets-aligned user transcripts and system responses for each block. Experiments show that our approach produces more coherent and interpretable responses than existing duplex methods while supporting lower-latency and overlapping interactions compared to turn-by-turn systems.