IndexTTS 2.5 Technical Report

📅 2026-01-07
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of limited language coverage, low inference efficiency, and difficulty in cross-lingual emotional transfer in zero-shot multilingual expressive speech synthesis. To this end, the authors propose an efficient and emotionally controllable text-to-speech (TTS) system that reduces semantic encoder-decoder frame rates, replaces the U-DiT architecture with Zipformer, and introduces three explicit cross-lingual modeling strategies. Furthermore, they employ GRPO-based reinforcement learning to optimize the generation process, achieving cross-lingual emotional transfer for the first time without any target-language emotional data. The system supports Chinese, English, Japanese, and Spanish, delivering a 2.28× speedup in inference while matching IndexTTS 2 in word error rate and speaker similarity, thereby significantly enhancing both synthesis quality and generalization capability.

Technology Category

Application Category

📝 Abstract
In prior work, we introduced IndexTTS 2, a zero-shot neural text-to-speech foundation model comprising two core components: a transformer-based Text-to-Semantic (T2S) module and a non-autoregressive Semantic-to-Mel (S2M) module, which together enable faithful emotion replication and establish the first autoregressive duration-controllable generative paradigm. Building upon this, we present IndexTTS 2.5, which significantly enhances multilingual coverage, inference speed, and overall synthesis quality through four key improvements: 1) Semantic Codec Compression: we reduce the semantic codec frame rate from 50 Hz to 25 Hz, halving sequence length and substantially lowering both training and inference costs; 2) Architectural Upgrade: we replace the U-DiT-based backbone of the S2M module with a more efficient Zipformer-based modeling architecture, achieving notable parameter reduction and faster mel-spectrogram generation; 3) Multilingual Extension: We propose three explicit cross-lingual modeling strategies, boundary-aware alignment, token-level concatenation, and instruction-guided generation, establishing practical design principles for zero-shot multilingual emotional TTS that supports Chinese, English, Japanese, and Spanish, and enables robust emotion transfer even without target-language emotional training data; 4) Reinforcement Learning Optimization: we apply GRPO in post-training of the T2S module, improving pronunciation accuracy and natrualness. Experiments show that IndexTTS 2.5 not only supports broader language coverage but also replicates emotional prosody in unseen languages under the same zero-shot setting. IndexTTS 2.5 achieves a 2.28 times improvement in RTF while maintaining comparable WER and speaker similarity to IndexTTS 2.
Problem

Research questions and friction points this paper is trying to address.

zero-shot TTS
multilingual speech synthesis
emotional prosody replication
efficient inference
text-to-speech
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic Codec Compression
Zipformer Architecture
Zero-shot Multilingual TTS
Cross-lingual Emotion Transfer
Reinforcement Learning Optimization
🔎 Similar Papers
No similar papers found.
Y
Yunpei Li
Bilibili Inc.
Xun Zhou
Xun Zhou
Professor of Computer Science, Harbin Institute of Technology, Shenzhen (HIT-SZ)
Big data analyticsSpatial databaseSpatial Data MiningGISmachine learning
J
Jinchao Wang
Bilibili Inc.
L
Lu Wang
Bilibili Inc.
Y
Yong Wu
Bilibili Inc.
S
Siyi Zhou
Bilibili Inc.
Y
Yiquan Zhou
Bilibili Inc.
J
Jingchen Shu
Bilibili Inc.