🤖 AI Summary
This work proposes CAST-TTS, a novel text-to-speech (TTS) framework that unifies voice and textual prompts for speaker control within a single, streamlined architecture. Unlike conventional TTS systems that employ separate modules for processing speech and text cues—leading to fragmented modeling—CAST-TTS integrates both prompt types through a unified cross-attention mechanism. It leverages pre-trained encoders to extract features from speech and text inputs and aligns their representations in a shared embedding space via a multi-stage training strategy. This approach eliminates the need for complex, specialized components while maintaining architectural simplicity. Experimental results demonstrate that CAST-TTS achieves speech synthesis quality on par with dedicated single-input models, thereby validating the efficacy and superiority of unified speaker control in end-to-end TTS systems.
📝 Abstract
Current Text-to-Speech (TTS) systems typically use separate models for speech-prompted and text-prompted timbre control. While unifying both control signals into a single model is desirable, the challenge of cross-modal alignment often results in overly complex architectures and training objective. To address this challenge, we propose CAST-TTS, a simple yet effective framework for unified timbre control. Features are extracted from speech prompts and text prompts using pre-trained encoders. The multi-stage training strategy efficiently aligns the speech and projected text representations within a shared embedding space. A single cross-attention mechanism then allows the model to use either of these representations to control the timbre. Extensive experiments validate that the unified cross-attention mechanism is critical for achieving high-quality synthesis. CAST-TTS achieves performance comparable to specialized single-input models while operating within a unified architecture. The demo page can be accessed at https://HiRookie9.github.io/CAST-TTS-Page.