Mini-Omni-Reasoner: Token-Level Thinking-in-Speaking in Large Speech Models

📅 2025-08-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large speech models (LSMs) adopt a “think-then-speak” paradigm, necessitating full inference completion before speech generation—introducing significant latency and hindering real-time interaction. Method: We propose Mini-Omni-Reasoner, the first LSM to introduce token-level “speaking-while-thinking,” enabling dynamic interleaving of silent reasoning and speech output at the token level. Its hierarchical Thinker-Talker architecture incorporates local semantic alignment to support fine-grained coordination between reasoning and articulation. Contribution/Results: Trained on our newly curated Spoken-Math-Problems-3M dataset, Mini-Omni-Reasoner achieves +19.1% arithmetic reasoning accuracy and +6.4% context understanding on the Spoken-MQA benchmark. It generates more concise outputs and eliminates decoding latency—achieving zero-latency, structured, and logically coherent in-speech reasoning for the first time.

Technology Category

Application Category

📝 Abstract
Reasoning is essential for effective communication and decision-making. While recent advances in LLMs and MLLMs have shown that incorporating explicit reasoning significantly improves understanding and generalization, reasoning in LSMs remains in a nascent stage. Early efforts attempt to transfer the "Thinking-before-Speaking" paradigm from textual models to speech. However, this sequential formulation introduces notable latency, as spoken responses are delayed until reasoning is fully completed, impairing real-time interaction and communication efficiency. To address this, we propose Mini-Omni-Reasoner, a framework that enables reasoning within speech via a novel "Thinking-in-Speaking" formulation. Rather than completing reasoning before producing any verbal output, Mini-Omni-Reasoner interleaves silent reasoning tokens with spoken response tokens at the token level. This design allows continuous speech generation while embedding structured internal reasoning, leveraging the model's high-frequency token processing capability. Although interleaved, local semantic alignment is enforced to ensure that each response token is informed by its preceding reasoning. To support this framework, we introduce Spoken-Math-Problems-3M, a large-scale dataset tailored for interleaved reasoning and response. The dataset ensures that verbal tokens consistently follow relevant reasoning content, enabling accurate and efficient learning of speech-coupled reasoning. Built on a hierarchical Thinker-Talker architecture, Mini-Omni-Reasoner delivers fluent yet logically grounded spoken responses, maintaining both naturalness and precision. On the Spoken-MQA benchmark, it achieves a +19.1% gain in arithmetic reasoning and +6.4% in contextual understanding, with shorter outputs and zero decoding latency.
Problem

Research questions and friction points this paper is trying to address.

Enable real-time reasoning during speech generation
Reduce latency in speech models by interleaving tokens
Improve accuracy and efficiency in spoken responses
Innovation

Methods, ideas, or system contributions that make the work stand out.

Interleaves silent reasoning tokens with spoken response tokens
Leverages token-level processing for real-time speech generation
Uses hierarchical Thinker-Talker architecture for logical responses
🔎 Similar Papers
No similar papers found.
Zhifei Xie
Zhifei Xie
Tsinghua University
Artificial IntelligenceLarge Multimodal ModelGPT4o
Z
Ziyang Ma
Nanyang Technological University
Z
Zihang Liu
Beijing Institute of Technology
K
Kaiyu Pang
Beijing Institute of Technology
H
Hongyu Li
Beihang University
J
Jialin Zhang
National University of Singapore
Yue Liao
Yue Liao
National University of Singapore
Computer VisionDeep LearningMLLM
Deheng Ye
Deheng Ye
Director of AI, Tencent
Applied machine learning
Chunyan Miao
Chunyan Miao
Nanyang Technological University
human agent interactionhuman computationcognitive agentsincentivesserious games
S
Shuicheng Yan
National University of Singapore