ASPIRin: Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models

πŸ“… 2026-04-11
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the semantic degradation, repetition, and generation collapse commonly observed in full-duplex spoken language models when optimizing interaction timing via reinforcement learning over raw tokens, issues arising from the tight coupling of timing control and semantic generation. To resolve this, the authors propose an action space projection mechanism that explicitly decouples turn-taking control from content generation for the first time. The approach integrates a binarized action space, a rule-based reward function, and Group Relative Policy Optimization (GRPO). Empirical results demonstrate that this method preserves semantic coherence while reducing repetitive n-gram ratios by over 50%, significantly improving interactive performance in terms of turn-taking naturalness, response latency, and filler-word handling.

Technology Category

Application Category

πŸ“ Abstract
End-to-end full-duplex Speech Language Models (SLMs) require precise turn-taking for natural interaction. However, optimizing temporal dynamics via standard raw-token reinforcement learning (RL) degrades semantic quality, causing severe generative collapse and repetition. We propose ASPIRin, an interactivity-optimized RL framework that explicitly decouples when to speak from what to say. Using Action Space Projection, ASPIRin maps the text vocabulary into a coarse-grained binary state (active speech vs. inactive silence). By applying Group Relative Policy Optimization (GRPO) with rule-based rewards, it balances user interruption and response latency. Empirical evaluations show ASPIRin optimizes interactivity across turn-taking, backchanneling, and pause handling. Crucially, isolating timing from token selection preserves semantic coherence and reduces the portion of duplicate n-grams by over 50% compared to standard GRPO, effectively eliminating degenerative repetition.
Problem

Research questions and friction points this paper is trying to address.

full-duplex Speech Language Models
turn-taking
reinforcement learning
generative collapse
semantic coherence
Innovation

Methods, ideas, or system contributions that make the work stand out.

Action Space Projection
Full-Duplex Speech Language Models
Interactivity-Optimized RL
Turn-Taking
Generative Coherence
πŸ”Ž Similar Papers
No similar papers found.