π€ AI Summary
This work addresses the semantic degradation, repetition, and generation collapse commonly observed in full-duplex spoken language models when optimizing interaction timing via reinforcement learning over raw tokens, issues arising from the tight coupling of timing control and semantic generation. To resolve this, the authors propose an action space projection mechanism that explicitly decouples turn-taking control from content generation for the first time. The approach integrates a binarized action space, a rule-based reward function, and Group Relative Policy Optimization (GRPO). Empirical results demonstrate that this method preserves semantic coherence while reducing repetitive n-gram ratios by over 50%, significantly improving interactive performance in terms of turn-taking naturalness, response latency, and filler-word handling.
π Abstract
End-to-end full-duplex Speech Language Models (SLMs) require precise turn-taking for natural interaction. However, optimizing temporal dynamics via standard raw-token reinforcement learning (RL) degrades semantic quality, causing severe generative collapse and repetition. We propose ASPIRin, an interactivity-optimized RL framework that explicitly decouples when to speak from what to say. Using Action Space Projection, ASPIRin maps the text vocabulary into a coarse-grained binary state (active speech vs. inactive silence). By applying Group Relative Policy Optimization (GRPO) with rule-based rewards, it balances user interruption and response latency. Empirical evaluations show ASPIRin optimizes interactivity across turn-taking, backchanneling, and pause handling. Crucially, isolating timing from token selection preserves semantic coherence and reduces the portion of duplicate n-grams by over 50% compared to standard GRPO, effectively eliminating degenerative repetition.