🤖 AI Summary
This work addresses the challenge of achieving fine-grained and intention-faithful expressive control in text-to-speech (TTS) systems under complex instructions, which stems from a structural mismatch between discrete textual intents and continuous acoustic manifestations. To bridge this gap, we propose a cognition-inspired multi-agent collaborative framework that leverages adversarial disentanglement to separate speaker identity from emotional prosody, employs a dual-stream acoustic prototype bank to anchor abstract intents, and integrates a fast-slow perceptual feedback mechanism to enhance semantic-acoustic consistency and output expressiveness. Our approach uniquely unifies adversarial disentanglement, prototype retrieval, gated attention fusion, and latent-space gradient correction within a single TTS architecture. Experiments on composite instruction benchmarks and public test sets demonstrate significant improvements over state-of-the-art models in both consistency and naturalness of expressive control.
📝 Abstract
While existing text-to-speech (TTS) models exhibit high expressiveness, fine-grained control over composite instructions remains challenging due to the structural mismatch between discrete textual intents and continuous acoustic realizations. Inspired by human cognitive decoupling, we introduce AgentSteerTTS, a multi-agent closed-loop framework designed for intent-faithful expressive control of composite instructions. First, in our framework, an adversarial disentanglement agent mitigates speaker-emotion leakage by learning separable identity and emotion-prosody subspaces with leakage-suppressing regularization. Next, a Dual-Stream Anchoring Controller grounds abstract intents using a large-scale acoustic prototype library: a Retrieval Agent selects expressive anchors, while a Synthesis Agent fuses them into continuous control vectors via gated attention. Finally, a Fast-Slow Feedback Agent refines output intensity through latent gradient correction and resolves semantic-acoustic mismatches using high-level perceptual critique. Experiments on a composite-instruction benchmark and public test sets show that AgentSteerTTS yields consistent and significant improvements to the baselines, demonstrating the effectiveness of the proposed method.