RephraseTTS: Dynamic Length Text based Speech Insertion with Speaker Style Transfer

📅 2025-08-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses conditional speech insertion in speech editing: dynamically inserting variable-length speech segments into a source utterance while strictly preserving the original speaker’s timbre, prosody, and spectral characteristics—given a full text transcript. To this end, we propose a Transformer-based non-autoregressive end-to-end model that jointly encodes textual semantics, local speech rhythm, and acoustic features. Crucially, the model adaptively determines insertion segment duration during inference based on both lexical content and contextual prosodic cues—without requiring forced alignment or post-processing. Evaluated on LibriTTS, our method significantly outperforms adaptive TTS baselines in objective and subjective metrics. A user study confirms high naturalness and stylistic consistency of the edited speech. Our approach establishes a new paradigm for controllable, high-fidelity speech editing.

Technology Category

Application Category

📝 Abstract
We propose a method for the task of text-conditioned speech insertion, i.e. inserting a speech sample in an input speech sample, conditioned on the corresponding complete text transcript. An example use case of the task would be to update the speech audio when corrections are done on the corresponding text transcript. The proposed method follows a transformer-based non-autoregressive approach that allows speech insertions of variable lengths, which are dynamically determined during inference, based on the text transcript and tempo of the available partial input. It is capable of maintaining the speaker's voice characteristics, prosody and other spectral properties of the available speech input. Results from our experiments and user study on LibriTTS show that our method outperforms baselines based on an existing adaptive text to speech method. We also provide numerous qualitative results to appreciate the quality of the output from the proposed method.
Problem

Research questions and friction points this paper is trying to address.

Inserting variable-length speech segments into existing audio
Maintaining speaker voice characteristics during speech insertion
Enabling dynamic text-conditioned speech modifications and corrections
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based non-autoregressive dynamic length insertion
Speaker style transfer maintaining voice characteristics
Text-conditioned speech insertion with variable tempo adaptation
🔎 Similar Papers
No similar papers found.
N
Neeraj Matiyali
Indian Institute of Technology, Kanpur
Siddharth Srivastava
Siddharth Srivastava
Arizona State University
Artificial IntelligenceAutomated PlanningRoboticsTask and Motion PlanningAI Assessment
G
Gaurav Sharma
Indian Institute of Technology, Kanpur