MINT-Bench: A Comprehensive Multilingual Benchmark for Instruction-Following Text-to-Speech

📅 2026-04-20
📈 Citations: 0
Influential: 0
📄 PDF

career value

181K/year
🤖 AI Summary
This work addresses the absence of a fine-grained, multilingual benchmark for instruction-following text-to-speech (TTS) systems by introducing the first evaluation framework supporting ten languages. The framework integrates a hierarchical multi-axis taxonomy, a multi-stage data construction pipeline, and a hybrid evaluation protocol combining automatic metrics and human assessments to jointly measure content consistency, instruction adherence, and perceptual quality. Designed with diagnostic capabilities, it uncovers critical limitations in current systems—particularly in handling compound controls and paralinguistic expression. Evaluations reveal that leading commercial systems generally outperform others, though certain open-source models excel in specific languages such as Chinese. The project publicly releases its dataset, toolkit, leaderboard, and an interactive demo system to foster further research and development in instruction-following TTS.

Technology Category

Application Category

📝 Abstract
Instruction-following text-to-speech (TTS) has emerged as an important capability for controllable and expressive speech generation, yet its evaluation remains underdeveloped due to limited benchmark coverage, weak diagnostic granularity, and insufficient multilingual support. We present \textbf{MINT-Bench}, a comprehensive multilingual benchmark for instruction-following TTS. MINT-Bench is built upon a hierarchical multi-axis taxonomy, a scalable multi-stage data construction pipeline, and a hierarchical hybrid evaluation protocol that jointly assesses content consistency, instruction following, and perceptual quality. Experiments across ten languages show that current systems remain far from solved: frontier commercial systems lead overall, while leading open-source models become highly competitive and can even outperform commercial counterparts in localized settings such as Chinese. The benchmark further reveals that harder compositional and paralinguistic controls remain major bottlenecks for current systems. We release MINT-Bench together with the data construction and evaluation toolkit to support future research on controllable, multilingual, and diagnostically grounded TTS evaluation. The leaderboard and demo are available at https://longwaytog0.github.io/MINT-Bench/
Problem

Research questions and friction points this paper is trying to address.

instruction-following TTS
multilingual benchmark
speech generation evaluation
controllable TTS
diagnostic evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

instruction-following TTS
multilingual benchmark
hierarchical evaluation
controllable speech synthesis
diagnostic assessment
🔎 Similar Papers
H
Huakang Chen
Audio, Speech and Language Processing Lab (ASLP@NPU), Northwestern Polytechnical University, Xi’an, China
J
Jingbin Hu
Audio, Speech and Language Processing Lab (ASLP@NPU), Northwestern Polytechnical University, Xi’an, China
Liumeng Xue
Liumeng Xue
Hong Kong University of Science and Technology
Audio Speech and Language ProcessingSpeech Generation
Q
Qirui Zhan
Audio, Speech and Language Processing Lab (ASLP@NPU), Northwestern Polytechnical University, Xi’an, China
W
Wenhao Li
Audio, Speech and Language Processing Lab (ASLP@NPU), Northwestern Polytechnical University, Xi’an, China
Guobin Ma
Guobin Ma
Northwestern Polytechnical University
Hanke Xie
Hanke Xie
Northwestern Polytechnical University
Audio speech synthesis
Dake Guo
Dake Guo
Northwestern Polytechnical University
Speech ProcessingSpeech Synthesis
L
Linhan Ma
Audio, Speech and Language Processing Lab (ASLP@NPU), Northwestern Polytechnical University, Xi’an, China
Yuepeng Jiang
Yuepeng Jiang
Northwestern Polytechnical University
Speech ProcessingSpeech SynthesisVoice Conversion
B
Bengu Wu
Yutu Zhineng, Beijing, China
P
Pengyuan Xie
Lingguang Zhaxian Technology, Shanghai, China
C
Chuan Xie
Lingguang Zhaxian Technology, Shanghai, China
Qiang Zhang
Qiang Zhang
University of Science and Technology of China
quantum informationquantum optics
Lei Xie
Lei Xie
Northwestern Polytechnical University
speech processingspeech recognitionspeech synthesismultimediaartificial intelligence