CAST-TTS: A Simple Cross-Attention Framework for Unified Timbre Control in TTS

📅 2026-03-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes CAST-TTS, a novel text-to-speech (TTS) framework that unifies voice and textual prompts for speaker control within a single, streamlined architecture. Unlike conventional TTS systems that employ separate modules for processing speech and text cues—leading to fragmented modeling—CAST-TTS integrates both prompt types through a unified cross-attention mechanism. It leverages pre-trained encoders to extract features from speech and text inputs and aligns their representations in a shared embedding space via a multi-stage training strategy. This approach eliminates the need for complex, specialized components while maintaining architectural simplicity. Experimental results demonstrate that CAST-TTS achieves speech synthesis quality on par with dedicated single-input models, thereby validating the efficacy and superiority of unified speaker control in end-to-end TTS systems.

Technology Category

Application Category

📝 Abstract
Current Text-to-Speech (TTS) systems typically use separate models for speech-prompted and text-prompted timbre control. While unifying both control signals into a single model is desirable, the challenge of cross-modal alignment often results in overly complex architectures and training objective. To address this challenge, we propose CAST-TTS, a simple yet effective framework for unified timbre control. Features are extracted from speech prompts and text prompts using pre-trained encoders. The multi-stage training strategy efficiently aligns the speech and projected text representations within a shared embedding space. A single cross-attention mechanism then allows the model to use either of these representations to control the timbre. Extensive experiments validate that the unified cross-attention mechanism is critical for achieving high-quality synthesis. CAST-TTS achieves performance comparable to specialized single-input models while operating within a unified architecture. The demo page can be accessed at https://HiRookie9.github.io/CAST-TTS-Page.
Problem

Research questions and friction points this paper is trying to address.

Text-to-Speech
timbre control
cross-modal alignment
unified framework
speech synthesis
Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-attention
unified timbre control
text-to-speech
multi-modal alignment
TTS
🔎 Similar Papers
No similar papers found.