🤖 AI Summary
Traditional word error rate (WER) struggles to differentiate error types and is overly sensitive to sandhi and agglutinative phenomena prevalent in Indic languages, thereby failing to accurately reflect the true performance of speech recognition systems. To address this limitation, this work proposes SCRIBE, a diagnostic framework that enables fine-grained decomposition of recognition errors into lexical, punctuation, numeric, and domain-specific entity categories through sandhi-tolerant alignment and domain vocabulary injection. By integrating large language models, SCRIBE constructs a high-quality rich-text transcription system and establishes, for the first time, fine-grained ASR evaluation metrics for Indic languages—specifically Hindi, Malayalam, and Kannada—that align closely with expert judgments and significantly outperform conventional WER. The authors also release the diagnostic framework, an LLM-based data processing pipeline, and a high-performance multilingual speech recognition model.
📝 Abstract
Automatic speech recognition replaces typing only when correction costs less than manual entry, a threshold determined by error types, not counts: fixing a misrecognized domain term costs far more than inserting a comma. Word error rate (WER) fails on two fronts: it collapses distinct error categories into a single scalar, and it structurally penalizes agglutinative languages where valid sandhi merges inflate scores. We introduce SCRIBE, a diagnostic framework that provides categorical error decomposition into lexical, punctuation, numeral, and domain-entity rates through sandhi-tolerant alignment with domain vocabulary injection. Human validation confirms SCRIBE aligns with expert judgment where WER does not. We release SCRIBE, an LLM curation pipeline, benchmarks, and open-weight rich transcription models for Hindi, Malayalam, and Kannada.