Bengali-Loop: Community Benchmarks for Long-Form Bangla ASR and Speaker Diarization

📅 2026-02-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of high-quality public benchmarks for Bengali in long-form automatic speech recognition (ASR) and speaker diarization. The authors introduce two community benchmarks tailored to real-world scenarios: a 158.6-hour ASR corpus with human-verified transcripts and a 22-hour speaker diarization corpus with fully manual annotations, accompanied by standardized evaluation protocols and data formats. Leveraging a pipeline of subtitle extraction and manual verification, the study releases 191 ASR audio recordings (792,000 words) and 24 diarization recordings (5,744 annotated segments). Baseline systems—including pyannote.audio and Tugstugi—are provided, establishing an ASR word error rate (WER) of 34.07% and a diarization error rate (DER) of 40.08%, thereby filling a critical resource gap for long-form spoken language processing in Bengali.

Technology Category

Application Category

📝 Abstract
Bengali (Bangla) remains under-resourced in long-form speech technology despite its wide use. We present Bengali-Loop, two community benchmarks to address this gap: (1) a long-form ASR corpus of 191 recordings (158.6 hours, 792k words) from 11 YouTube channels, collected via a reproducible subtitle-extraction pipeline and human-in-the-loop transcript verification; and (2) a speaker diarization corpus of 24 recordings (22 hours, 5,744 annotated segments) with fully manual speaker-turn labels in CSV format. Both benchmarks target realistic multi-speaker, long-duration content (e.g., Bangla drama/natok). We establish baselines (Tugstugi: 34.07% WER; pyannote.audio: 40.08% DER) and provide standardized evaluation protocols (WER/CER, DER), annotation rules, and data formats to support reproducible benchmarking and future model development for Bangla long-form ASR and diarization.
Problem

Research questions and friction points this paper is trying to address.

Bengali
long-form speech
ASR
speaker diarization
under-resourced language
Innovation

Methods, ideas, or system contributions that make the work stand out.

long-form ASR
speaker diarization
Bengali speech benchmarks
human-in-the-loop verification
reproducible evaluation
🔎 Similar Papers
No similar papers found.
H
H. M. Shadman Tabib
Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology
I
Istiak Ahmmed Rifti
Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology
A
Abdullah Muhammed Amimul Ehsan
Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology
S
Somik Dasgupta
Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology
M
Md Zim Mim Siddiqee Sowdha
Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology
A
Abrar Jahin Sarker
Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology
M
Md. Rafiul Islam Nijamy
Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology
Tanvir Hossain
Tanvir Hossain
Georgia State University
Machine LearningData MiningNatural Language Processing
M
Mst. Metaly Khatun
Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology
M
Munzer Mahmood
Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology
R
Rakesh Debnath
Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology
G
Gourab Biswas
Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology
A
Asif Karim
Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology
W
Wahid Al Azad Navid
Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology
M
Masnoon Muztahid
Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology
F
Fuad Ahmed Udoy
Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology
S
Shahad Shahriar Rahman
Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology
M
Md. Tashdiqur Rahman Shifat
Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology
M
Most. Sonia Khatun
Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology
M
Mushfiqur Rahman
Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology
M
Md. Miraj Hasan
Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology
A
Anik Saha
Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology
M
Mohammad Ninad Mahmud Nobo
Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology
S
Soumik Bhattacharjee
Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology
T
Tusher Bhomik
Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology