CAST-TTS: A Simple Cross-Attention Framework for Unified Timbre Control in TTS

📅 2026-03-17

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This work proposes CAST-TTS, a novel text-to-speech (TTS) framework that unifies voice and textual prompts for speaker control within a single, streamlined architecture. Unlike conventional TTS systems that employ separate modules for processing speech and text cues—leading to fragmented modeling—CAST-TTS integrates both prompt types through a unified cross-attention mechanism. It leverages pre-trained encoders to extract features from speech and text inputs and aligns their representations in a shared embedding space via a multi-stage training strategy. This approach eliminates the need for complex, specialized components while maintaining architectural simplicity. Experimental results demonstrate that CAST-TTS achieves speech synthesis quality on par with dedicated single-input models, thereby validating the efficacy and superiority of unified speaker control in end-to-end TTS systems.

Technology Category

Application Category

📝 Abstract

Current Text-to-Speech (TTS) systems typically use separate models for speech-prompted and text-prompted timbre control. While unifying both control signals into a single model is desirable, the challenge of cross-modal alignment often results in overly complex architectures and training objective. To address this challenge, we propose CAST-TTS, a simple yet effective framework for unified timbre control. Features are extracted from speech prompts and text prompts using pre-trained encoders. The multi-stage training strategy efficiently aligns the speech and projected text representations within a shared embedding space. A single cross-attention mechanism then allows the model to use either of these representations to control the timbre. Extensive experiments validate that the unified cross-attention mechanism is critical for achieving high-quality synthesis. CAST-TTS achieves performance comparable to specialized single-input models while operating within a unified architecture. The demo page can be accessed at https://HiRookie9.github.io/CAST-TTS-Page.

Problem

Research questions and friction points this paper is trying to address.

Text-to-Speech

timbre control

cross-modal alignment

unified framework

speech synthesis

Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-attention

unified timbre control

text-to-speech