๐ค AI Summary
Existing open-source end-to-end spoken dialogue models still exhibit limitations in both intelligence and expressiveness, and direct application of preference optimization often leads to unstable training due to the coupling between sparse semantic supervision and dense speech generation. This work proposes a modality-aware adaptive post-training approach that constrains preference updates to the semantic channel while explicitly anchoring acoustic behavior. By dynamically adjusting the mixing ratio between semantic and acoustic optimization based on rollout statistics, the method effectively decouples their optimization pathways. To the best of our knowledge, this is the first approach to enable stable reinforcement learningโbased post-training for spoken dialogue models, consistently improving both semantic quality and vocal expressiveness across multiple benchmarks and mainstream architectures.
๐ Abstract
End-to-end spoken dialogue models have garnered significant attention because they offer a higher potential ceiling in expressiveness and perceptual ability than cascaded systems. However, the intelligence and expressiveness of current open-source spoken dialogue models often remain below expectations. Motivated by the success of online reinforcement learning(RL) in other domains, one might attempt to directly apply preference optimization to spoken dialogue models, yet this transfer is non-trivial. We analyze these obstacles from the perspectives of reward modeling and rollout sampling, focusing on how sparse preference supervision interacts with dense speech generation under shared-parameter updates. Based on the analysis, we propose a modality-aware adaptive post-training recipe that makes RL practical for spoken dialogue: it constrains preference updates to the semantic channel and improves acoustic behavior via explicit anchoring, while dynamically regulating their mixture from rollout statistics to avoid unreliable preference gradients. We evaluate the method across multiple spoken dialogue benchmarks and representative architectures, and observe consistent improvements in semantic quality and speech expressiveness.