🤖 AI Summary
Existing Transformer-based text-to-speech (TTS) systems commonly employ Rotary Position Embedding (RoPE), which relies on absolute token indices and thus struggles to model the dynamic alignment between text and speech—particularly under variable speech durations or long-sequence generation, leading to significant performance degradation. To address this, we propose Length-Aware Rotary Position Encoding (LARoPE), which explicitly encodes relative query-key distances via length-normalized positional indices, thereby enhancing robustness in cross-modal alignment. Integrated into the Transformer’s cross-attention mechanism, LARoPE requires no additional parameters while improving stability in long-sequence modeling. Experiments demonstrate that LARoPE accelerates convergence, improves alignment accuracy and speech naturalness, achieves state-of-the-art word error rate on zero-shot TTS benchmarks, and enables high-fidelity synthesis of utterances up to 30 seconds in duration.
📝 Abstract
Many recent text-to-speech (TTS) systems are built on transformer architectures and employ cross-attention mechanisms for text-speech alignment. Within these systems, rotary position embedding (RoPE) is commonly used to encode positional information in text and speech representations. In this work, we introduce length-aware RoPE (LARoPE), a simple yet effective extension of RoPE that improves text-speech alignment. Unlike RoPE, which relies on absolute indices, LARoPE computes relative distances between query and key positions using length-normalized indices. Experimental results show that LARoPE consistently outperforms RoPE, offering faster loss convergence, more accurate text-speech alignment, and higher overall TTS quality. Furthermore, LARoPE demonstrates greater resilience to variations in utterance duration and maintains stable performance in extended speech generation up to 30 seconds, whereas RoPE suffers from notable degradation. Notably, our method achieves a state-of-the-art word error rate on a standard zero-shot TTS benchmark.