Multi-interaction TTS toward professional recording reproduction

📅 2025-07-01

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Current text-to-speech (TTS) systems lack the capability to perform fine-grained, multi-turn style refinement—akin to human-directed audio post-production—making it difficult to satisfy users’ precise stylistic control requirements. To address this, we propose the first TTS framework supporting multi-step interactive style optimization, inspired by the collaborative dynamics between voice directors and actors. Our method introduces a feedback-driven interactive modeling paradigm: we construct a curated instruction-speech paired dataset and design an end-to-end differentiable style correction module that dynamically refines speech outputs based on natural language feedback. Experiments demonstrate that the model consistently improves timbre, prosody, and emotional expressiveness in response to user instructions, significantly enhancing alignment with target styles (average improvement of 28.6%). Moreover, the framework enables real-time listening and online iterative validation, facilitating practical, user-in-the-loop TTS customization.

Technology Category

Application Category

📝 Abstract

Voice directors often iteratively refine voice actors' performances by providing feedback to achieve the desired outcome. While this iterative feedback-based refinement process is important in actual recordings, it has been overlooked in text-to-speech synthesis (TTS). As a result, fine-grained style refinement after the initial synthesis is not possible, even though the synthesized speech often deviates from the user's intended style. To address this issue, we propose a TTS method with multi-step interaction that allows users to intuitively and rapidly refine synthetized speech. Our approach models the interaction between the TTS model and its user to emulate the relationship between voice actors and voice directors. Experiments show that the proposed model with its corresponding dataset enable iterative style refinements in accordance with users' directions, thus demonstrating its multi-interaction capability. Sample audios are available: https://ntt-hilab-gensp. github.io/ssw13multiinteraction_tts/

Problem

Research questions and friction points this paper is trying to address.

Enables iterative style refinement in TTS synthesis

Models actor-director interaction for voice refinement

Addresses lack of post-synthesis fine-grained style control

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-step interaction for TTS refinement

Modeling actor-director feedback relationship

Iterative style refinement via user direction

🔎 Similar Papers

No similar papers found.