TTS-PRISM: A Perceptual Reasoning and Interpretable Speech Model for Fine-Grained Diagnosis

📅 2026-04-24

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

This work addresses the limitations of existing Chinese text-to-speech (TTS) evaluation methods, which rely predominantly on holistic metrics and struggle to diagnose fine-grained acoustic artifacts and perceptual degradations. To overcome this, we propose the first multidimensional diagnostic framework that integrates perceptual reasoning with interpretability. Leveraging twelve acoustic-perceptual dimensions, we construct high-quality diagnostic data by combining expert-defined anchors with adversarially perturbed samples. A schema-driven instruction-tuning strategy embeds human rating logic into an end-to-end evaluation model. Evaluated on a 1,600-sample gold-standard test set, our model significantly outperforms general-purpose approaches in human correlation and successfully establishes intuitive diagnostic profiles for six major TTS paradigms, revealing their nuanced performance differences. Code and models are publicly released.

Technology Category

Application Category

📝 Abstract

While generative text-to-speech (TTS) models approach human-level quality, monolithic metrics fail to diagnose fine-grained acoustic artifacts or explain perceptual collapse. To address this, we propose TTS-PRISM, a multi-dimensional diagnostic framework for Mandarin. First, we establish a 12-dimensional schema spanning stability to advanced expressiveness. Second, we design a targeted synthesis pipeline with adversarial perturbations and expert anchors to build a high-quality diagnostic dataset. Third, schema-driven instruction tuning embeds explicit scoring criteria and reasoning into an efficient end-to-end model. Experiments on a 1,600-sample Gold Test Set show TTS-PRISM outperforms generalist models in human alignment. Profiling six TTS paradigms establishes intuitive diagnostic flags that reveal fine-grained capability differences. TTS-PRISM is open-source, with code and checkpoints at https://github.com/xiaomi-research/tts-prism.

Problem

Research questions and friction points this paper is trying to address.

text-to-speech

fine-grained diagnosis

perceptual evaluation

acoustic artifacts

speech quality assessment

Innovation

Methods, ideas, or system contributions that make the work stand out.

perceptual reasoning

interpretable speech model

fine-grained diagnosis