CosyEdit2: Speech-Editing-Oriented Reinforcement Learning Unlocks Better Zero-Shot TTS

πŸ“… 2026-05-25
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing speech editing methods are constrained by the quality of paired data and coarse-grained supervision, often failing to preserve local acoustic consistency. This work proposes a two-stage post-training framework: first initializing an editing model via supervised fine-tuning, then applying Group Relative Policy Optimization (GRPO) on unpaired target-free speech dataβ€”marking the first application of reinforcement learning to speech editing. The approach eliminates reliance on paired data and jointly optimizes speech editing and zero-shot text-to-speech (TTS) performance, uncovering a deep synergistic relationship between the two tasks. Experimental results demonstrate that the proposed method significantly outperforms current state-of-the-art approaches in both objective metrics and subjective listening tests, simultaneously enhancing editing naturalness and zero-shot voice synthesis capability.
πŸ“ Abstract
Speech editing and zero-shot Text-to-Speech (TTS) share a similar generative foundation conditioned on speech prompts, yet speech editing demands far stricter local acoustic consistency with surrounding unedited content. While prior work has shown that Supervised Fine-Tuning (SFT) enables TTS models to acquire functional editing capability, this approach remains fundamentally bottlenecked by imperfect paired editing data and coarse-grained optimization signals. To address these limitations, we propose CosyEdit2, a speech editing model built on a two-stage post-training framework that progresses from supervised editing initialization to editing-oriented Group Relative Policy Optimization (GRPO) over target-speech-free data. Extensive experiments demonstrate that CosyEdit2 not only substantially advances speech editing performance, but also unlocks better zero-shot TTS capability, revealing a deeper mutual relationship between the two tasks. Audio samples are available at https://cjy1018.github.io/CosyEdit2.
Problem

Research questions and friction points this paper is trying to address.

speech editing
zero-shot TTS
acoustic consistency
paired editing data
optimization signals
Innovation

Methods, ideas, or system contributions that make the work stand out.

speech editing
zero-shot TTS
reinforcement learning
Group Relative Policy Optimization
two-stage post-training
πŸ”Ž Similar Papers
No similar papers found.