GLM-TTS Technical Report

📅 2025-12-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the integrated requirements of efficiency, controllability, and high fidelity in production-grade text-to-speech (TTS), this paper proposes GLM-TTS—a two-stage end-to-end speech synthesis framework comprising autoregressive modeling from text to acoustic tokens and diffusion-based waveform reconstruction from tokens. Methodologically, it introduces: (1) a pitch-constrained speech tokenizer—first of its kind—to enhance fundamental frequency (F0) modeling accuracy; (2) GRPO, a multi-objective reinforcement learning framework jointly optimizing naturalness, speaker similarity, and controllability via heterogeneous rewards; and (3) LoRA-based lightweight voice customization and hybrid phoneme-text input for fine-grained pronunciation control. Evaluated on 100K hours of training data, GLM-TTS achieves state-of-the-art performance among open-source TTS systems. It has been deployed in real time on the Z.ai and Qingyan platforms, supporting live demonstrations.

Technology Category

Application Category

📝 Abstract
This work proposes GLM-TTS, a production-level TTS system designed for efficiency, controllability, and high-fidelity speech generation. GLM-TTS follows a two-stage architecture, consisting of a text-to-token autoregressive model and a token-to-waveform diffusion model. With only 100k hours of training data, GLM-TTS achieves state-of-the-art performance on multiple open-source benchmarks. To meet production requirements, GLM-TTS improves speech quality through an optimized speech tokenizer with fundamental frequency constraints and a GRPO-based multi-reward reinforcement learning framework that jointly optimizes pronunciation, speaker similarity, and expressive prosody. In parallel, the system enables efficient and controllable deployment via parameter-efficient LoRA-based voice customization and a hybrid phoneme-text input scheme that provides precise pronunciation control. Our code is available at https://github.com/zai-org/GLM-TTS. Real-time speech synthesis demos are provided via Z.ai (audio.z.ai), the Zhipu Qingyan app/web (chatglm.cn).
Problem

Research questions and friction points this paper is trying to address.

Develops a production-level TTS system for efficient, controllable, high-fidelity speech generation
Improves speech quality with an optimized tokenizer and multi-reward reinforcement learning
Enables efficient voice customization and precise pronunciation control for deployment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage autoregressive and diffusion model architecture
Optimized tokenizer with F0 constraints and GRPO reinforcement learning
LoRA-based voice customization and hybrid phoneme-text input
🔎 Similar Papers
No similar papers found.
J
Jiayan Cui
Zhipu AI
Z
Zhihan Yang
Zhipu AI
N
Naihan Li
Zhipu AI
J
Jiankun Tian
Zhipu AI
Xingyu Ma
Xingyu Ma
Zhipu AI
Y
Yi Zhang
Zhipu AI
G
Guangyu Chen
Zhipu AI
R
Runxuan Yang
Zhipu AI
Y
Yuqing Cheng
Zhipu AI
Yizhi Zhou
Yizhi Zhou
George mason university
RoboticsSLAMState Estimation
G
Guochen Yu
Zhipu AI
Xiaotao Gu
Xiaotao Gu
Zhipu AI
Language ModelingGenerative ModelsData Mining
Jie Tang
Jie Tang
UW Madison
Computed Tomography