Length-Aware Rotary Position Embedding for Text-Speech Alignment

📅 2025-09-14

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

Existing Transformer-based text-to-speech (TTS) systems commonly employ Rotary Position Embedding (RoPE), which relies on absolute token indices and thus struggles to model the dynamic alignment between text and speech—particularly under variable speech durations or long-sequence generation, leading to significant performance degradation. To address this, we propose Length-Aware Rotary Position Encoding (LARoPE), which explicitly encodes relative query-key distances via length-normalized positional indices, thereby enhancing robustness in cross-modal alignment. Integrated into the Transformer’s cross-attention mechanism, LARoPE requires no additional parameters while improving stability in long-sequence modeling. Experiments demonstrate that LARoPE accelerates convergence, improves alignment accuracy and speech naturalness, achieves state-of-the-art word error rate on zero-shot TTS benchmarks, and enables high-fidelity synthesis of utterances up to 30 seconds in duration.

Technology Category

Application Category

📝 Abstract

Many recent text-to-speech (TTS) systems are built on transformer architectures and employ cross-attention mechanisms for text-speech alignment. Within these systems, rotary position embedding (RoPE) is commonly used to encode positional information in text and speech representations. In this work, we introduce length-aware RoPE (LARoPE), a simple yet effective extension of RoPE that improves text-speech alignment. Unlike RoPE, which relies on absolute indices, LARoPE computes relative distances between query and key positions using length-normalized indices. Experimental results show that LARoPE consistently outperforms RoPE, offering faster loss convergence, more accurate text-speech alignment, and higher overall TTS quality. Furthermore, LARoPE demonstrates greater resilience to variations in utterance duration and maintains stable performance in extended speech generation up to 30 seconds, whereas RoPE suffers from notable degradation. Notably, our method achieves a state-of-the-art word error rate on a standard zero-shot TTS benchmark.

Problem

Research questions and friction points this paper is trying to address.

Improving text-speech alignment in TTS systems

Enhancing position encoding with length-normalized indices

Addressing performance degradation in long speech generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Length-aware RoPE extension for alignment

Relative distance computation with normalized indices

Improved resilience to utterance duration variations

🔎 Similar Papers

A Text is Worth Several Tokens: Text Embedding from LLMs Secretly Aligns Well with The Key Tokens