DiffuSpeech: Silent Thought, Spoken Answer via Unified Speech-Text Diffusion

πŸ“… 2026-01-30
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the limited explicit reasoning capability of existing speech language models, which struggle to correct errors after speech generation. The authors propose a β€œthink silently, speak aloud” paradigm and introduce the first diffusion-based speech-text language model capable of explicit reasoning in spoken question answering, along with the first speech QA dataset annotated with textual reasoning traces. Their approach employs a masked diffusion architecture with modality-specific masking schedules to jointly model discrete text and tokenized speech, enabling synchronized generation of reasoning trajectories and spoken responses. Experiments demonstrate that the model outperforms the strongest baseline by 9 percentage points in QA accuracy, achieves state-of-the-art TTS quality (6.2% WER), and retains strong language understanding performance (66.2% MMLU).

Technology Category

Application Category

πŸ“ Abstract
Current speech language models generate responses directly without explicit reasoning, leading to errors that cannot be corrected once audio is produced. We introduce \textbf{``Silent Thought, Spoken Answer''} -- a paradigm where speech LLMs generate internal text reasoning alongside spoken responses, with thinking traces informing speech quality. To realize this, we present \method{}, the first diffusion-based speech-text language model supporting both understanding and generation, unifying discrete text and tokenized speech under a single masked diffusion framework. Unlike autoregressive approaches, \method{} jointly generates reasoning traces and speech tokens through iterative denoising, with modality-specific masking schedules. We also construct \dataset{}, the first speech QA dataset with paired text reasoning traces, containing 26K samples totaling 319 hours. Experiments show \method{} achieves state-of-the-art speech-to-speech QA accuracy, outperforming the best baseline by up to 9 points, while attaining the best TTS quality among generative models (6.2\% WER) and preserving language understanding (66.2\% MMLU). Ablations confirm that both the diffusion architecture and thinking traces contribute to these gains.
Problem

Research questions and friction points this paper is trying to address.

speech language models
explicit reasoning
speech-to-speech QA
error correction
reasoning traces
Innovation

Methods, ideas, or system contributions that make the work stand out.

speech-text diffusion
reasoning traces
masked diffusion framework
speech-to-speech QA
non-autoregressive generation
πŸ”Ž Similar Papers
No similar papers found.
Y
Yuxuan Lou
National University of Singapore
Ziming Wu
Ziming Wu
Hong Kong University of Science and Technology
Y
Yaochen Wang
National University of Singapore
Yong Liu
Yong Liu
National University of Singapore
Machine LearningReinforcement Learning
Y
Yingxuan Ren
National University of Singapore
F
Fuming Lai
Tencent
S
Shaobing Lian
Tencent
Jie Tang
Jie Tang
UW Madison
Computed Tomography
Yang You
Yang You
Postdoc, Stanford University
3D visioncomputer graphicscomputational geometry