TTS-PRISM: A Perceptual Reasoning and Interpretable Speech Model for Fine-Grained Diagnosis

πŸ“… 2026-04-24
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

181K/year
πŸ€– AI Summary
This work addresses the limitations of existing Chinese text-to-speech (TTS) evaluation methods, which rely predominantly on holistic metrics and struggle to diagnose fine-grained acoustic artifacts and perceptual degradations. To overcome this, we propose the first multidimensional diagnostic framework that integrates perceptual reasoning with interpretability. Leveraging twelve acoustic-perceptual dimensions, we construct high-quality diagnostic data by combining expert-defined anchors with adversarially perturbed samples. A schema-driven instruction-tuning strategy embeds human rating logic into an end-to-end evaluation model. Evaluated on a 1,600-sample gold-standard test set, our model significantly outperforms general-purpose approaches in human correlation and successfully establishes intuitive diagnostic profiles for six major TTS paradigms, revealing their nuanced performance differences. Code and models are publicly released.

Technology Category

Application Category

πŸ“ Abstract
While generative text-to-speech (TTS) models approach human-level quality, monolithic metrics fail to diagnose fine-grained acoustic artifacts or explain perceptual collapse. To address this, we propose TTS-PRISM, a multi-dimensional diagnostic framework for Mandarin. First, we establish a 12-dimensional schema spanning stability to advanced expressiveness. Second, we design a targeted synthesis pipeline with adversarial perturbations and expert anchors to build a high-quality diagnostic dataset. Third, schema-driven instruction tuning embeds explicit scoring criteria and reasoning into an efficient end-to-end model. Experiments on a 1,600-sample Gold Test Set show TTS-PRISM outperforms generalist models in human alignment. Profiling six TTS paradigms establishes intuitive diagnostic flags that reveal fine-grained capability differences. TTS-PRISM is open-source, with code and checkpoints at https://github.com/xiaomi-research/tts-prism.
Problem

Research questions and friction points this paper is trying to address.

text-to-speech
fine-grained diagnosis
perceptual evaluation
acoustic artifacts
speech quality assessment
Innovation

Methods, ideas, or system contributions that make the work stand out.

perceptual reasoning
interpretable speech model
fine-grained diagnosis
instruction tuning
adversarial perturbation
πŸ”Ž Similar Papers
No similar papers found.