🤖 AI Summary
This work addresses the absence of a fine-grained, multilingual benchmark for instruction-following text-to-speech (TTS) systems by introducing the first evaluation framework supporting ten languages. The framework integrates a hierarchical multi-axis taxonomy, a multi-stage data construction pipeline, and a hybrid evaluation protocol combining automatic metrics and human assessments to jointly measure content consistency, instruction adherence, and perceptual quality. Designed with diagnostic capabilities, it uncovers critical limitations in current systems—particularly in handling compound controls and paralinguistic expression. Evaluations reveal that leading commercial systems generally outperform others, though certain open-source models excel in specific languages such as Chinese. The project publicly releases its dataset, toolkit, leaderboard, and an interactive demo system to foster further research and development in instruction-following TTS.
📝 Abstract
Instruction-following text-to-speech (TTS) has emerged as an important capability for controllable and expressive speech generation, yet its evaluation remains underdeveloped due to limited benchmark coverage, weak diagnostic granularity, and insufficient multilingual support. We present \textbf{MINT-Bench}, a comprehensive multilingual benchmark for instruction-following TTS. MINT-Bench is built upon a hierarchical multi-axis taxonomy, a scalable multi-stage data construction pipeline, and a hierarchical hybrid evaluation protocol that jointly assesses content consistency, instruction following, and perceptual quality. Experiments across ten languages show that current systems remain far from solved: frontier commercial systems lead overall, while leading open-source models become highly competitive and can even outperform commercial counterparts in localized settings such as Chinese. The benchmark further reveals that harder compositional and paralinguistic controls remain major bottlenecks for current systems. We release MINT-Bench together with the data construction and evaluation toolkit to support future research on controllable, multilingual, and diagnostically grounded TTS evaluation. The leaderboard and demo are available at https://longwaytog0.github.io/MINT-Bench/