AST: Adaptive, Seamless, and Training-Free Precise Speech Editing

📅 2026-04-17
📈 Citations: 0
Influential: 0
📄 PDF

career value

179K/year
🤖 AI Summary
This work addresses the limitations of existing speech editing methods, which rely on task-specific training, incur high data costs, and struggle to preserve temporal consistency and speaker identity in unedited regions. The authors propose a training-free editing framework leveraging a pretrained autoregressive text-to-speech (TTS) model, enabling precise splicing between source and target speech through latent recomposition. To ensure natural transitions at edit boundaries without disrupting the generative manifold, they introduce Adaptive Weak Factor Guidance (AWFG). Additionally, they construct a new dataset, LibriSpeech-Edit, and propose a word-level dynamic time warping (WDTW) metric for evaluation. Experiments demonstrate that, compared to the strongest baseline, their method significantly improves temporal consistency in unedited segments and reduces word error rate by nearly 70%. When applied to a base TTS model, it achieves a 27% reduction in WDTW, setting a new state of the art in speaker identity preservation and temporal fidelity.

Technology Category

Application Category

📝 Abstract
Text-based speech editing aims to modify specific segments while preserving speaker identity and acoustic context. Existing methods rely on task-specific training, which incurs high data costs and struggles with temporal fidelity in unedited regions. Meanwhile, adapting Text-to-Speech (TTS) models often faces a trade-off between editing quality and consistency. To address these issues, we propose AST, an Adaptive, Seamless, and Training-free precise speech editing framework. Leveraging a pre-trained autoregressive TTS model, AST introduces Latent Recomposition to selectively stitch preserved source segments with newly synthesized targets. Furthermore, AST extends this latent manipulation to enable precise style editing for specific speech segments. To prevent artifacts at these edit boundaries, the framework incorporates Adaptive Weak Fact Guidance (AWFG). AWFG dynamically modulates a mel-space guidance signal, enforcing structural constraints only where necessary without disrupting the generative manifold. To fill the gap of publicly accessible benchmarks, we introduce LibriSpeech-Edit, a new and larger speech editing dataset. As existing metrics poorly evaluate temporal consistency in unedited regions, we propose Word-level Dynamic Time Warping (WDTW). Extensive experiments demonstrate that AST resolves the controllability-quality trade-off without extra training. Compared to the previous most temporally consistent baseline, AST improves consistency while reducing Word Error Rate by nearly 70%. Moreover, applying AST to a foundation TTS model reduces WDTW by 27%, achieving state-of-the-art speaker preservation and temporal fidelity.
Problem

Research questions and friction points this paper is trying to address.

speech editing
temporal fidelity
speaker identity
text-to-speech
training-free
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-Free Speech Editing
Latent Recomposition
Adaptive Weak Fact Guidance
Temporal Fidelity
Text-to-Speech Adaptation
🔎 Similar Papers
No similar papers found.
S
Sihan Lv
School of Software Technology, Zhejiang University, Hangzhou, China
Y
Yechen Jin
School of Software Technology, Zhejiang University, Hangzhou, China
Z
Zhen Li
School of Software Technology, Zhejiang University, Hangzhou, China; Institute of Remote Sensing Satellite, China Academy of Space Technology, Beijing, China
J
Jintao Chen
School of Software Technology, Zhejiang University, Hangzhou, China; Innovation and Management Center of the School of Software (Ningbo), Zhejiang University, Ningbo, China
J
Jinshan Zhang
School of Software Technology, Zhejiang University, Hangzhou, China; Innovation and Management Center of the School of Software (Ningbo), Zhejiang University, Ningbo, China
Ying Li
Ying Li
Zhejiang University
Service ComputingBusiness Process Management
Jianwei Yin
Jianwei Yin
Professor of Computer Science and Technology, Zhejiang University
Service ComputingComputer ArchitectureDistributed ComputingAI
Meng Xi
Meng Xi
College of Computer Science and Technology, Zhejiang University
service computingservice patterndata miningartificial intelligence