🤖 AI Summary
This work addresses the significant challenge of simultaneously achieving deep reasoning and fluent expression in real-time spoken language generation. The authors propose InterRS, a novel approach that interleaves reasoning steps within natural speech pauses to enable “thinking while speaking” in a real-time setting. Key innovations include a pioneering data generation pipeline that produces thought-speech interleaved sequences with controllable length ratios, an interleaved supervised fine-tuning strategy, and a dual-reward reinforcement learning mechanism combining TA-Balance and Linguistic Quality metrics. Experimental results demonstrate that InterRS improves performance by 13% on mathematical and logical reasoning benchmarks while substantially enhancing the naturalness of synthesized speech and the coherence of spoken chain-of-thought reasoning.
📝 Abstract
The thinking-while-speaking paradigm aims to make AI communication more human. A key challenge is maintaining fluent speech while performing deep reasoning. Our method, InterRS, tackles this by inserting reasoning steps only during natural speech generation. This requires high-quality data where reasoning and speech are precisely aligned, and the length ratio are under controlled. We introduce a novel pipeline to generate such seamlessly interleaved audio data. To train our model, we combine interleaved SFT with refined data and reinforcement learning with two new rewards: a TA-Balance Reward to manage timing and thinking-answer ratio, and a Linguistic Quality Reward to refine expression. Experiments show our approach achieves 13% better performance on mathmatical and logic benchmarks while generating instant response like a spoken-language instruct model which outputs fast CoT response. Furthermore, our method generates more natural and fluent answers than prior methods.