π€ AI Summary
This work addresses the limitations of existing Chinese text-to-speech (TTS) evaluation methods, which rely predominantly on holistic metrics and struggle to diagnose fine-grained acoustic artifacts and perceptual degradations. To overcome this, we propose the first multidimensional diagnostic framework that integrates perceptual reasoning with interpretability. Leveraging twelve acoustic-perceptual dimensions, we construct high-quality diagnostic data by combining expert-defined anchors with adversarially perturbed samples. A schema-driven instruction-tuning strategy embeds human rating logic into an end-to-end evaluation model. Evaluated on a 1,600-sample gold-standard test set, our model significantly outperforms general-purpose approaches in human correlation and successfully establishes intuitive diagnostic profiles for six major TTS paradigms, revealing their nuanced performance differences. Code and models are publicly released.
π Abstract
While generative text-to-speech (TTS) models approach human-level quality, monolithic metrics fail to diagnose fine-grained acoustic artifacts or explain perceptual collapse. To address this, we propose TTS-PRISM, a multi-dimensional diagnostic framework for Mandarin. First, we establish a 12-dimensional schema spanning stability to advanced expressiveness. Second, we design a targeted synthesis pipeline with adversarial perturbations and expert anchors to build a high-quality diagnostic dataset. Third, schema-driven instruction tuning embeds explicit scoring criteria and reasoning into an efficient end-to-end model. Experiments on a 1,600-sample Gold Test Set show TTS-PRISM outperforms generalist models in human alignment. Profiling six TTS paradigms establishes intuitive diagnostic flags that reveal fine-grained capability differences. TTS-PRISM is open-source, with code and checkpoints at https://github.com/xiaomi-research/tts-prism.