🤖 AI Summary
This work addresses the limitations of existing speech recognition benchmarks for Indian languages, which predominantly rely on clean, read-style speech and word error rate (WER) evaluation against a single reference transcript—failing to account for prevalent orthographic variations, especially in code-mixed English. To bridge this gap, the authors introduce the first large-scale, real-world Indian speech benchmark, comprising 306,230 utterances (536 hours) from 36,691 speakers across 139 district clusters in 15 major Indian languages. The dataset consists of spontaneous telephone conversations and employs multi-reference transcriptions to accommodate spelling variability. This benchmark enables fine-grained performance analysis along dimensions such as administrative region, audio quality, speaking rate, gender, and device type, revealing substantial disparities in current ASR systems across demographic, geographic, and acoustic conditions, thereby establishing critical data and evaluation standards for real-world Indian language speech recognition.
📝 Abstract
Existing Indic ASR benchmarks often use scripted, clean speech and leaderboard driven evaluation that encourages dataset specific overfitting. In addition, strict single reference WER penalizes natural spelling variation in Indian languages, including non standardized spellings of code-mixed English origin words. To address these limitations, we introduce Voice of India, a closed source benchmark built from unscripted telephonic conversations covering 15 major Indian languages across 139 regional clusters. The dataset contains 306230 utterances, totaling 536 hours of speech from 36691 speakers with transcripts accounting for spelling variations. We also analyze performance geographically at the district level, revealing disparities. Finally, we provide detailed analysis across factors such as audio quality, speaking rate, gender, and device type, highlighting where current ASR systems struggle and offering insights for improving real world Indic ASR systems.