Bengali-Loop: Community Benchmarks for Long-Form Bangla ASR and Speaker Diarization

📅 2026-02-15

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work addresses the lack of high-quality public benchmarks for Bengali in long-form automatic speech recognition (ASR) and speaker diarization. The authors introduce two community benchmarks tailored to real-world scenarios: a 158.6-hour ASR corpus with human-verified transcripts and a 22-hour speaker diarization corpus with fully manual annotations, accompanied by standardized evaluation protocols and data formats. Leveraging a pipeline of subtitle extraction and manual verification, the study releases 191 ASR audio recordings (792,000 words) and 24 diarization recordings (5,744 annotated segments). Baseline systems—including pyannote.audio and Tugstugi—are provided, establishing an ASR word error rate (WER) of 34.07% and a diarization error rate (DER) of 40.08%, thereby filling a critical resource gap for long-form spoken language processing in Bengali.

Technology Category

Application Category

📝 Abstract

Bengali (Bangla) remains under-resourced in long-form speech technology despite its wide use. We present Bengali-Loop, two community benchmarks to address this gap: (1) a long-form ASR corpus of 191 recordings (158.6 hours, 792k words) from 11 YouTube channels, collected via a reproducible subtitle-extraction pipeline and human-in-the-loop transcript verification; and (2) a speaker diarization corpus of 24 recordings (22 hours, 5,744 annotated segments) with fully manual speaker-turn labels in CSV format. Both benchmarks target realistic multi-speaker, long-duration content (e.g., Bangla drama/natok). We establish baselines (Tugstugi: 34.07% WER; pyannote.audio: 40.08% DER) and provide standardized evaluation protocols (WER/CER, DER), annotation rules, and data formats to support reproducible benchmarking and future model development for Bangla long-form ASR and diarization.

Problem

Research questions and friction points this paper is trying to address.

Bengali

long-form speech

ASR

speaker diarization

under-resourced language

Innovation

Methods, ideas, or system contributions that make the work stand out.

long-form ASR

speaker diarization

Bengali speech benchmarks

human-in-the-loop verification

reproducible evaluation

🔎 Similar Papers

No similar papers found.