Mini-Omni-Reasoner: Token-Level Thinking-in-Speaking in Large Speech Models

📅 2025-08-18
📈 Citations: 0
Influential: 0
📄 PDF

career value

183K/year
🤖 AI Summary
Current large speech models (LSMs) adopt a “think-then-speak” paradigm, necessitating full inference completion before speech generation—introducing significant latency and hindering real-time interaction. Method: We propose Mini-Omni-Reasoner, the first LSM to introduce token-level “speaking-while-thinking,” enabling dynamic interleaving of silent reasoning and speech output at the token level. Its hierarchical Thinker-Talker architecture incorporates local semantic alignment to support fine-grained coordination between reasoning and articulation. Contribution/Results: Trained on our newly curated Spoken-Math-Problems-3M dataset, Mini-Omni-Reasoner achieves +19.1% arithmetic reasoning accuracy and +6.4% context understanding on the Spoken-MQA benchmark. It generates more concise outputs and eliminates decoding latency—achieving zero-latency, structured, and logically coherent in-speech reasoning for the first time.

Technology Category

Application Category

📝 Abstract
Reasoning is essential for effective communication and decision-making. While recent advances in LLMs and MLLMs have shown that incorporating explicit reasoning significantly improves understanding and generalization, reasoning in LSMs remains in a nascent stage. Early efforts attempt to transfer the "Thinking-before-Speaking" paradigm from textual models to speech. However, this sequential formulation introduces notable latency, as spoken responses are delayed until reasoning is fully completed, impairing real-time interaction and communication efficiency. To address this, we propose Mini-Omni-Reasoner, a framework that enables reasoning within speech via a novel "Thinking-in-Speaking" formulation. Rather than completing reasoning before producing any verbal output, Mini-Omni-Reasoner interleaves silent reasoning tokens with spoken response tokens at the token level. This design allows continuous speech generation while embedding structured internal reasoning, leveraging the model's high-frequency token processing capability. Although interleaved, local semantic alignment is enforced to ensure that each response token is informed by its preceding reasoning. To support this framework, we introduce Spoken-Math-Problems-3M, a large-scale dataset tailored for interleaved reasoning and response. The dataset ensures that verbal tokens consistently follow relevant reasoning content, enabling accurate and efficient learning of speech-coupled reasoning. Built on a hierarchical Thinker-Talker architecture, Mini-Omni-Reasoner delivers fluent yet logically grounded spoken responses, maintaining both naturalness and precision. On the Spoken-MQA benchmark, it achieves a +19.1% gain in arithmetic reasoning and +6.4% in contextual understanding, with shorter outputs and zero decoding latency.
Problem

Research questions and friction points this paper is trying to address.

Enable real-time reasoning during speech generation
Reduce latency in speech models by interleaving tokens
Improve accuracy and efficiency in spoken responses
Innovation

Methods, ideas, or system contributions that make the work stand out.

Interleaves silent reasoning tokens with spoken response tokens
Leverages token-level processing for real-time speech generation
Uses hierarchical Thinker-Talker architecture for logical responses
🔎 Similar Papers
No similar papers found.