Preferences of a Voice-First Nation: Large-Scale Pairwise Evaluation and Preference Analysis for TTS in Indian Languages

📅 2026-04-23
📈 Citations: 0
Influential: 0
📄 PDF

career value

189K/year
🤖 AI Summary
This work addresses the high variance and multidimensional perceptual complexity in evaluating multilingual text-to-speech (TTS) systems for Indian languages by proposing the first scalable, language-controllable evaluation framework. Through large-scale crowdsourced pairwise preference experiments across ten Indian languages—encompassing over 5,000 utterances and 120,000 judgments—the study integrates multidimensional perceptual annotations, Bradley–Terry modeling, and SHAP-based interpretability analysis. This approach reveals, for the first time, the performance trade-offs of TTS models across six key dimensions: intelligibility, expressiveness, audio quality, naturalness, speaker similarity, and linguistic accuracy. The project establishes the first human preference leaderboard for Indian multilingual TTS, validates evaluation reliability, and provides an interpretable, fine-grained benchmark for future research in multilingual speech synthesis.

Technology Category

Application Category

📝 Abstract
Crowdsourced pairwise evaluation has emerged as a scalable approach for assessing foundation models. However, applying it to Text to Speech(TTS) introduces high variance due to linguistic diversity and multidimensional nature of speech perception. We present a controlled multidimensional pairwise evaluation framework for multilingual TTS that combines linguistic control with perceptually grounded annotation. Using 5K+ native and code-mixed sentences across 10 Indic languages, we evaluate 7 state-of-the-art TTS systems and collect over 120K pairwise comparisons from over 1900 native raters. In addition to overall preference, raters provide judgments across 6 perceptual dimensions: intelligibility, expressiveness, voice quality, liveliness, noise, and hallucinations. Using Bradley-Terry modeling, we construct a multilingual leaderboard, interpret human preference using SHAP analysis and analyze leaderboard reliability alongside model strengths and trade-offs across perceptual dimensions.
Problem

Research questions and friction points this paper is trying to address.

Text-to-Speech
pairwise evaluation
multilingual TTS
speech perception
Indian languages
Innovation

Methods, ideas, or system contributions that make the work stand out.

multidimensional pairwise evaluation
multilingual TTS
perceptual annotation
Bradley-Terry modeling
SHAP analysis
🔎 Similar Papers
2023-10-10arXiv.orgCitations: 0
Srija Anand
Srija Anand
MS by Research, AI4Bharat, IIT Madras
Speech SynthesisNatural Language ProcessingLLM Evaluation
Ashwin Sankar
Ashwin Sankar
MS by Research @ IIT Madras, AI4Bharat
Speech SynthesisSpeech TranslationMulti-modal AITTSLLM
I
Ishvinder Sethi
AI4Bharat, India
A
Aaditya Pareek
Josh Talks, India
K
Kartik Rajput
AI4Bharat, India
Gaurav Yadav
Gaurav Yadav
AMITY UNIVERSITY
power systempower electronics
N
Nikhil Narasimhan
AI4Bharat, India
A
Adish Pandya
Indian Institute of Technology, Madras, India; AI4Bharat, India
Deepon Halder
Deepon Halder
Researcher
AI
Mohammed Safi Ur Rahman Khan
Mohammed Safi Ur Rahman Khan
PhD @ IIT Madras, AI4Bharat, Wadhwani School of Data Science and AI
Multimodal language modelsLarge language modelsNatural Language processingLanguage Modelling
P
Praveen S V
Indian Institute of Technology, Madras, India; AI4Bharat, India
S
Shobhit Banga
Josh Talks, India
M
Mitesh M Khapra
Indian Institute of Technology, Madras, India; AI4Bharat, India