AgentSteerTTS: A Multi-Agent Closed-Loop Framework for Composite-Instruction Text-to-Speech

📅 2026-05-14

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This work addresses the challenge of achieving fine-grained and intention-faithful expressive control in text-to-speech (TTS) systems under complex instructions, which stems from a structural mismatch between discrete textual intents and continuous acoustic manifestations. To bridge this gap, we propose a cognition-inspired multi-agent collaborative framework that leverages adversarial disentanglement to separate speaker identity from emotional prosody, employs a dual-stream acoustic prototype bank to anchor abstract intents, and integrates a fast-slow perceptual feedback mechanism to enhance semantic-acoustic consistency and output expressiveness. Our approach uniquely unifies adversarial disentanglement, prototype retrieval, gated attention fusion, and latent-space gradient correction within a single TTS architecture. Experiments on composite instruction benchmarks and public test sets demonstrate significant improvements over state-of-the-art models in both consistency and naturalness of expressive control.

📝 Abstract

While existing text-to-speech (TTS) models exhibit high expressiveness, fine-grained control over composite instructions remains challenging due to the structural mismatch between discrete textual intents and continuous acoustic realizations. Inspired by human cognitive decoupling, we introduce AgentSteerTTS, a multi-agent closed-loop framework designed for intent-faithful expressive control of composite instructions. First, in our framework, an adversarial disentanglement agent mitigates speaker-emotion leakage by learning separable identity and emotion-prosody subspaces with leakage-suppressing regularization. Next, a Dual-Stream Anchoring Controller grounds abstract intents using a large-scale acoustic prototype library: a Retrieval Agent selects expressive anchors, while a Synthesis Agent fuses them into continuous control vectors via gated attention. Finally, a Fast-Slow Feedback Agent refines output intensity through latent gradient correction and resolves semantic-acoustic mismatches using high-level perceptual critique. Experiments on a composite-instruction benchmark and public test sets show that AgentSteerTTS yields consistent and significant improvements to the baselines, demonstrating the effectiveness of the proposed method.

Problem

Research questions and friction points this paper is trying to address.

text-to-speech

composite instructions

expressive control

structural mismatch

fine-grained control

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-agent framework

adversarial disentanglement

dual-stream anchoring