DiffuSpeech: Silent Thought, Spoken Answer via Unified Speech-Text Diffusion

📅 2026-01-30

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This work addresses the limited explicit reasoning capability of existing speech language models, which struggle to correct errors after speech generation. The authors propose a “think silently, speak aloud” paradigm and introduce the first diffusion-based speech-text language model capable of explicit reasoning in spoken question answering, along with the first speech QA dataset annotated with textual reasoning traces. Their approach employs a masked diffusion architecture with modality-specific masking schedules to jointly model discrete text and tokenized speech, enabling synchronized generation of reasoning trajectories and spoken responses. Experiments demonstrate that the model outperforms the strongest baseline by 9 percentage points in QA accuracy, achieves state-of-the-art TTS quality (6.2% WER), and retains strong language understanding performance (66.2% MMLU).

Technology Category

Application Category

📝 Abstract

Current speech language models generate responses directly without explicit reasoning, leading to errors that cannot be corrected once audio is produced. We introduce \textbf{``Silent Thought, Spoken Answer''} -- a paradigm where speech LLMs generate internal text reasoning alongside spoken responses, with thinking traces informing speech quality. To realize this, we present \method{}, the first diffusion-based speech-text language model supporting both understanding and generation, unifying discrete text and tokenized speech under a single masked diffusion framework. Unlike autoregressive approaches, \method{} jointly generates reasoning traces and speech tokens through iterative denoising, with modality-specific masking schedules. We also construct \dataset{}, the first speech QA dataset with paired text reasoning traces, containing 26K samples totaling 319 hours. Experiments show \method{} achieves state-of-the-art speech-to-speech QA accuracy, outperforming the best baseline by up to 9 points, while attaining the best TTS quality among generative models (6.2\% WER) and preserving language understanding (66.2\% MMLU). Ablations confirm that both the diffusion architecture and thinking traces contribute to these gains.

Problem

Research questions and friction points this paper is trying to address.

speech language models

explicit reasoning

speech-to-speech QA

error correction

reasoning traces

Innovation

Methods, ideas, or system contributions that make the work stand out.

speech-text diffusion

reasoning traces

masked diffusion framework