STITCH: Simultaneous Thinking and Talking with Chunked Reasoning for Spoken Language Models

📅 2025-07-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current speech-language models (SLMs) lack human-like “silent thinking” capabilities, struggling to balance reasoning depth with real-time responsiveness. To address this, we propose Chunked Reasoning—a mechanism that interleaves inference chunks and speech output chunks during natural pauses in audio playback, enabling concurrent reasoning and spoken response generation. This approach integrates chain-of-thought (CoT) reasoning with real-time speech scheduling, achieving complex inference without introducing additional latency. On mathematical reasoning benchmarks, our method improves accuracy by 15% over non-reasoning baselines, while maintaining baseline performance on non-reasoning tasks and incurring no statistically significant increase in end-to-end latency. To the best of our knowledge, this is the first work to realize low-latency, high-fidelity implicit reasoning tightly coupled with speech generation in SLMs. Our framework establishes a new paradigm for embodied intelligence and real-time human–machine dialogue systems.

Technology Category

Application Category

📝 Abstract
Spoken Language Models (SLMs) are designed to take speech inputs and produce spoken responses. However, current SLMs lack the ability to perform an internal, unspoken thinking process before responding. In contrast, humans typically engage in complex mental reasoning internally, enabling them to communicate ideas clearly and concisely. Thus, integrating an unspoken thought process into SLMs is highly desirable. While naively generating a complete chain-of-thought (CoT) reasoning before starting to talk can enable thinking for SLMs, this induces additional latency for the speech response, as the CoT reasoning can be arbitrarily long. To solve this issue, we propose Stitch, a novel generation method that alternates between the generation of unspoken reasoning chunks and spoken response chunks. Since the audio duration of a chunk of spoken response is much longer than the time to generate the tokens in a chunk of spoken response, we use the remaining free time to generate the unspoken reasoning tokens. When a chunk of audio is played to the user, the model continues to generate the next unspoken reasoning chunk, achieving simultaneous thinking and talking. Remarkably, Stitch matches the latency of baselines that cannot generate unspoken CoT by design while outperforming those baselines by 15% on math reasoning datasets; Stitch also performs equally well on non-reasoning datasets as those baseline models. Some animations and demonstrations are on the project page: https://d223302.github.io/STITCH.
Problem

Research questions and friction points this paper is trying to address.

SLMs lack internal unspoken thinking before responding
Generating full chain-of-thought increases speech latency
Need simultaneous thinking and talking in SLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Alternates reasoning and response chunks generation
Uses audio play time for reasoning tokens
Matches baseline latency with CoT reasoning
🔎 Similar Papers
No similar papers found.