π€ AI Summary
This work addresses the limited explicit reasoning capability of existing speech language models, which struggle to correct errors after speech generation. The authors propose a βthink silently, speak aloudβ paradigm and introduce the first diffusion-based speech-text language model capable of explicit reasoning in spoken question answering, along with the first speech QA dataset annotated with textual reasoning traces. Their approach employs a masked diffusion architecture with modality-specific masking schedules to jointly model discrete text and tokenized speech, enabling synchronized generation of reasoning trajectories and spoken responses. Experiments demonstrate that the model outperforms the strongest baseline by 9 percentage points in QA accuracy, achieves state-of-the-art TTS quality (6.2% WER), and retains strong language understanding performance (66.2% MMLU).
π Abstract
Current speech language models generate responses directly without explicit reasoning, leading to errors that cannot be corrected once audio is produced. We introduce \textbf{``Silent Thought, Spoken Answer''} -- a paradigm where speech LLMs generate internal text reasoning alongside spoken responses, with thinking traces informing speech quality. To realize this, we present \method{}, the first diffusion-based speech-text language model supporting both understanding and generation, unifying discrete text and tokenized speech under a single masked diffusion framework. Unlike autoregressive approaches, \method{} jointly generates reasoning traces and speech tokens through iterative denoising, with modality-specific masking schedules. We also construct \dataset{}, the first speech QA dataset with paired text reasoning traces, containing 26K samples totaling 319 hours. Experiments show \method{} achieves state-of-the-art speech-to-speech QA accuracy, outperforming the best baseline by up to 9 points, while attaining the best TTS quality among generative models (6.2\% WER) and preserving language understanding (66.2\% MMLU). Ablations confirm that both the diffusion architecture and thinking traces contribute to these gains.