Chronological Thinking in Full-Duplex Spoken Dialogue Language Models

📅 2025-10-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current full-duplex spoken dialogue models predict only silence tokens during the listening phase, lacking human-like real-time cognitive reasoning—leading to increased response latency and degraded output quality. To address this, we propose a strictly causal, zero-overhead-latency temporal reasoning paradigm that enables incremental, causal inference synchronized with streaming speech input, dynamically updating hidden states without lookahead. Our approach integrates streaming speech processing, a full-duplex architecture, and an online hidden-state update mechanism, allowing high-quality response generation immediately upon user pause. Experiments demonstrate significant improvements over baselines in both objective metrics (e.g., latency, BLEU, ROUGE) and human evaluations, achieving state-of-the-art performance in response timeliness, coherence, and dynamic adaptability.

Technology Category

Application Category

📝 Abstract
Recent advances in spoken dialogue language models (SDLMs) reflect growing interest in shifting from turn-based to full-duplex systems, where the models continuously perceive user speech streams while generating responses. This simultaneous listening and speaking design enables real-time interaction and the agent can handle dynamic conversational behaviors like user barge-in. However, during the listening phase, existing systems keep the agent idle by repeatedly predicting the silence token, which departs from human behavior: we usually engage in lightweight thinking during conversation rather than remaining absent-minded. Inspired by this, we propose Chronological Thinking, a on-the-fly conversational thinking mechanism that aims to improve response quality in full-duplex SDLMs. Specifically, chronological thinking presents a paradigm shift from conventional LLM thinking approaches, such as Chain-of-Thought, purpose-built for streaming acoustic input. (1) Strictly causal: the agent reasons incrementally while listening, updating internal hypotheses only from past audio with no lookahead. (2) No additional latency: reasoning is amortized during the listening window; once the user stops speaking, the agent halts thinking and begins speaking without further delay. Experiments demonstrate the effectiveness of chronological thinking through both objective metrics and human evaluations show consistent improvements in response quality. Furthermore, chronological thinking robustly handles conversational dynamics and attains competitive performance on full-duplex interaction metrics.
Problem

Research questions and friction points this paper is trying to address.

Improves response quality in full-duplex spoken dialogue systems
Enables incremental reasoning during listening without latency
Handles conversational dynamics like user barge-in robustly
Innovation

Methods, ideas, or system contributions that make the work stand out.

Strictly causal reasoning without lookahead during listening
Amortized thinking during listening adds no latency
Improves response quality in full-duplex spoken dialogue systems
🔎 Similar Papers
No similar papers found.
D
Donghang Wu
Nanyang Technological University
Haoyang Zhang
Haoyang Zhang
Ph.D. student of Computer Science, University of Illinois Urbana-Champaign
Computer ArchitectureSystem Software
C
Chen Chen
Nanyang Technological University
T
Tianyu Zhang
Mila
F
Fei Tian
StepFun
X
Xuerui Yang
StepFun
G
Gang Yu
StepFun
Hexin Liu
Hexin Liu
Nanyang Technological University
Speech recognitionlanguage identification
Nana Hou
Nana Hou
ZOOM | Ph.D. at Nanyang Technological University, Singapore
SpeechDeep Learning
Y
Yuchen Hu
Nanyang Technological University
E
Eng Siong Chng
Nanyang Technological University