Edit Content, Preserve Acoustics: Imperceptible Text-Based Speech Editing via Self-Consistency Rewards

๐Ÿ“… 2026-01-31
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the limitations of existing speech editing methods that operate in the acoustic space, where content and style are often entangled, leading to generation instability and boundary artifacts that hinder perceptually seamless text-driven editing. To overcome these challenges, the authors propose an โ€œedit content, preserve acousticsโ€ framework that decouples editing into the semantic space and employs a Flow Matching decoder to reconstruct acoustic features. Additionally, a self-consistency reward mechanism is introduced, leveraging a pre-trained text-to-speech (TTS) model as an implicit evaluator to enforce context-aware alignment. Experimental results demonstrate that the proposed approach significantly outperforms current autoregressive and non-autoregressive baselines in intelligibility, robustness, and perceptual quality, achieving high-fidelity, seamlessly edited speech.

Technology Category

Application Category

๐Ÿ“ Abstract
Imperceptible text-based speech editing allows users to modify spoken content by altering the transcript. It demands that modified segments fuse seamlessly with the surrounding context. Prevalent methods operating in the acoustic space suffer from inherent content-style entanglement, leading to generation instability and boundary artifacts. In this paper, we propose a novel framework grounded in the principle of"Edit Content, Preserve Acoustics". Our approach relies on two core components: (1) Structural Foundations, which decouples editing into a stable semantic space while delegating acoustic reconstruction to a Flow Matching decoder; and (2) Perceptual Alignment, which employs a novel Self-Consistency Rewards Group Relative Policy Optimization. By leveraging a pre-trained Text-to-Speech model as an implicit critic -- complemented by strict intelligibility and duration constraints -- we effectively align the edited semantic token sequence with the original context. Empirical evaluations demonstrate that our method significantly outperforms state-of-the-art autoregressive and non-autoregressive baselines, achieving superior intelligibility, robustness, and perceptual quality.
Problem

Research questions and friction points this paper is trying to address.

text-based speech editing
imperceptible editing
content-style entanglement
boundary artifacts
acoustic preservation
Innovation

Methods, ideas, or system contributions that make the work stand out.

speech editing
content-acoustic disentanglement
self-consistency rewards
flow matching
text-to-speech
๐Ÿ”Ž Similar Papers
No similar papers found.