🤖 AI Summary
This work addresses the lack of high-quality public benchmarks for Bengali in long-form automatic speech recognition (ASR) and speaker diarization. The authors introduce two community benchmarks tailored to real-world scenarios: a 158.6-hour ASR corpus with human-verified transcripts and a 22-hour speaker diarization corpus with fully manual annotations, accompanied by standardized evaluation protocols and data formats. Leveraging a pipeline of subtitle extraction and manual verification, the study releases 191 ASR audio recordings (792,000 words) and 24 diarization recordings (5,744 annotated segments). Baseline systems—including pyannote.audio and Tugstugi—are provided, establishing an ASR word error rate (WER) of 34.07% and a diarization error rate (DER) of 40.08%, thereby filling a critical resource gap for long-form spoken language processing in Bengali.
📝 Abstract
Bengali (Bangla) remains under-resourced in long-form speech technology despite its wide use. We present Bengali-Loop, two community benchmarks to address this gap: (1) a long-form ASR corpus of 191 recordings (158.6 hours, 792k words) from 11 YouTube channels, collected via a reproducible subtitle-extraction pipeline and human-in-the-loop transcript verification; and (2) a speaker diarization corpus of 24 recordings (22 hours, 5,744 annotated segments) with fully manual speaker-turn labels in CSV format. Both benchmarks target realistic multi-speaker, long-duration content (e.g., Bangla drama/natok). We establish baselines (Tugstugi: 34.07% WER; pyannote.audio: 40.08% DER) and provide standardized evaluation protocols (WER/CER, DER), annotation rules, and data formats to support reproducible benchmarking and future model development for Bangla long-form ASR and diarization.