Chronological Thinking in Full-Duplex Spoken Dialogue Language Models

📅 2025-10-02

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

Current full-duplex spoken dialogue models predict only silence tokens during the listening phase, lacking human-like real-time cognitive reasoning—leading to increased response latency and degraded output quality. To address this, we propose a strictly causal, zero-overhead-latency temporal reasoning paradigm that enables incremental, causal inference synchronized with streaming speech input, dynamically updating hidden states without lookahead. Our approach integrates streaming speech processing, a full-duplex architecture, and an online hidden-state update mechanism, allowing high-quality response generation immediately upon user pause. Experiments demonstrate significant improvements over baselines in both objective metrics (e.g., latency, BLEU, ROUGE) and human evaluations, achieving state-of-the-art performance in response timeliness, coherence, and dynamic adaptability.

Technology Category

Application Category

📝 Abstract

Recent advances in spoken dialogue language models (SDLMs) reflect growing interest in shifting from turn-based to full-duplex systems, where the models continuously perceive user speech streams while generating responses. This simultaneous listening and speaking design enables real-time interaction and the agent can handle dynamic conversational behaviors like user barge-in. However, during the listening phase, existing systems keep the agent idle by repeatedly predicting the silence token, which departs from human behavior: we usually engage in lightweight thinking during conversation rather than remaining absent-minded. Inspired by this, we propose Chronological Thinking, a on-the-fly conversational thinking mechanism that aims to improve response quality in full-duplex SDLMs. Specifically, chronological thinking presents a paradigm shift from conventional LLM thinking approaches, such as Chain-of-Thought, purpose-built for streaming acoustic input. (1) Strictly causal: the agent reasons incrementally while listening, updating internal hypotheses only from past audio with no lookahead. (2) No additional latency: reasoning is amortized during the listening window; once the user stops speaking, the agent halts thinking and begins speaking without further delay. Experiments demonstrate the effectiveness of chronological thinking through both objective metrics and human evaluations show consistent improvements in response quality. Furthermore, chronological thinking robustly handles conversational dynamics and attains competitive performance on full-duplex interaction metrics.

Problem

Research questions and friction points this paper is trying to address.

Improves response quality in full-duplex spoken dialogue systems

Enables incremental reasoning during listening without latency

Handles conversational dynamics like user barge-in robustly

Innovation

Methods, ideas, or system contributions that make the work stand out.

Strictly causal reasoning without lookahead during listening

Amortized thinking during listening adds no latency

Improves response quality in full-duplex spoken dialogue systems

🔎 Similar Papers

Time Awareness in Large Language Models: Benchmarking Fact Recall Across Time