Vividh-ASR: A Complexity-Tiered Benchmark and Optimization Dynamics for Robust Indic Speech Recognition

📅 2026-05-13

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This study addresses the degradation in spontaneous speech recognition performance often observed when fine-tuning multilingual automatic speech recognition (ASR) models on low-resource languages due to “studio bias.” To this end, the authors introduce Vividh-ASR, a hierarchical benchmark for Hindi and Malayalam encompassing four distinct acoustic conditions: studio, broadcast, spontaneous, and synthetic noise. They propose Reverse Multi-Stage Fine-Tuning (R-MFT), which combines aggressive early-stage parameter updates with a curriculum learning strategy that progresses from difficult to easier samples, thereby preserving encoder acoustic representations while substantially improving generalization. Built upon the Whisper architecture and enhanced with parameter-efficient fine-tuning alongside CKA and SVD-based representational analyses, the 244M R-MFT model reduces word error rate by 12 percentage points overall—matching or surpassing the performance of conventionally fine-tuned 769M models. Both the benchmark and models are publicly released.

📝 Abstract

Fine-tuning multilingual ASR models like Whisper for low-resource languages often improves read speech but degrades spontaneous audio performance, a phenomenon we term studio-bias. To diagnose this mismatch, we introduce Vividh-ASR, a complexity-stratified benchmark for Hindi and Malayalam across four tiers: studio, broadcast, spontaneous, and synthetic noise. Through a controlled study of learning-rate timing and curriculum ordering, we find that early large parameter updates improve global WER by 12 absolute points, while a hard-to-easy curriculum adds gains for spontaneous speech. These findings motivate reverse multi-stage fine-tuning (R-MFT), a training recipe that enables a parameter-efficient 244M Whisper model to match or exceed conventionally fine-tuned 769M counterparts. Representational analysis via CKA and SVD reveals effective schedules concentrate adaptation in the decoder, preserving the pre-trained encoder's acoustic geometry. We release the benchmark and models.

Problem

Research questions and friction points this paper is trying to address.

studio-bias

low-resource languages

spontaneous speech

automatic speech recognition

multilingual ASR

Innovation

Methods, ideas, or system contributions that make the work stand out.

complexity-tiered benchmark

reverse multi-stage fine-tuning

studio-bias