AgentSteerTTS: A Multi-Agent Closed-Loop Framework for Composite-Instruction Text-to-Speech

📅 2026-05-14
📈 Citations: 0
Influential: 0
📄 PDF

career value

210K/year
🤖 AI Summary
This work addresses the challenge of achieving fine-grained and intention-faithful expressive control in text-to-speech (TTS) systems under complex instructions, which stems from a structural mismatch between discrete textual intents and continuous acoustic manifestations. To bridge this gap, we propose a cognition-inspired multi-agent collaborative framework that leverages adversarial disentanglement to separate speaker identity from emotional prosody, employs a dual-stream acoustic prototype bank to anchor abstract intents, and integrates a fast-slow perceptual feedback mechanism to enhance semantic-acoustic consistency and output expressiveness. Our approach uniquely unifies adversarial disentanglement, prototype retrieval, gated attention fusion, and latent-space gradient correction within a single TTS architecture. Experiments on composite instruction benchmarks and public test sets demonstrate significant improvements over state-of-the-art models in both consistency and naturalness of expressive control.
📝 Abstract
While existing text-to-speech (TTS) models exhibit high expressiveness, fine-grained control over composite instructions remains challenging due to the structural mismatch between discrete textual intents and continuous acoustic realizations. Inspired by human cognitive decoupling, we introduce AgentSteerTTS, a multi-agent closed-loop framework designed for intent-faithful expressive control of composite instructions. First, in our framework, an adversarial disentanglement agent mitigates speaker-emotion leakage by learning separable identity and emotion-prosody subspaces with leakage-suppressing regularization. Next, a Dual-Stream Anchoring Controller grounds abstract intents using a large-scale acoustic prototype library: a Retrieval Agent selects expressive anchors, while a Synthesis Agent fuses them into continuous control vectors via gated attention. Finally, a Fast-Slow Feedback Agent refines output intensity through latent gradient correction and resolves semantic-acoustic mismatches using high-level perceptual critique. Experiments on a composite-instruction benchmark and public test sets show that AgentSteerTTS yields consistent and significant improvements to the baselines, demonstrating the effectiveness of the proposed method.
Problem

Research questions and friction points this paper is trying to address.

text-to-speech
composite instructions
expressive control
structural mismatch
fine-grained control
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-agent framework
adversarial disentanglement
dual-stream anchoring
composite-instruction TTS
closed-loop feedback