Bangla-WhisperDiar: Fine-Tuning Whisper and PyAnnote for Bangla Long-Form Speech Recognition and Speaker Diarization

📅 2026-05-06

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

This work addresses the challenges of automatic speech recognition (ASR) and speaker diarization for long-form Bangla speech in complex acoustic environments with multiple speakers. The authors propose an end-to-end jointly optimized system that fully fine-tunes the Whisper Medium model and replaces the segmentation backbone of PyAnnote 3.0. Leveraging a custom dataset, diverse audio augmentation strategies, and a standardized post-processing pipeline, the approach achieves substantial performance gains under low-resource conditions. Experimental results demonstrate a word error rate (WER) of 24.41% and a diarization error rate (DER) of 23.92% on the test set, significantly outperforming pretrained baselines. This study presents the first framework to jointly fine-tune Whisper and PyAnnote for Bangla, establishing a new benchmark for multilingual spoken language processing in resource-constrained settings.

📝 Abstract

Automatic Speech Recognition (ASR) and speaker diarization in Bangla remain challenging due to long form recordings, diverse acoustic conditions, and significant speaker variability. This work addresses these two core tasks in Bangla spoken language understanding by developing robust systems for long form ASR and speaker diarization. For ASR (Problem 1), we fine tune the tugstugi bengaliai regional asr whisper medium model on a custom-curated dataset of approximately 15,000 chunked and aligned Bangla audio segments, employing full weight training with extensive data augmentation including noise injection, reverb simulation, echo, clipping distortion, and pitch/time perturbation. For speaker diarization (Problem 2), we fine-tune the pyannote/segmentation-3.0 model using PyTorch Lightning on the competition annotated diarization dataset, swapping the fine-tuned segmentation backbone into the pyannote/speaker-diarization-community-1 pipeline while retaining the pretrained speaker embedding and clustering components. Our ASR system achieves a Word Error Rate (WER) of 0.2441, while our diarization system achieves a Diarization Error Rate (DER) of 0.2392, both evaluated on the test set, demonstrating notable improvements over the respective pretrained baselines. We describe our complete pipeline, including data preprocessing, text normalization, audio augmentation, training strategies, inference optimization, and post-processing for both tasks.

Problem

Research questions and friction points this paper is trying to address.

Automatic Speech Recognition

Speaker Diarization

Bangla

Long-Form Speech

Acoustic Variability

Innovation

Methods, ideas, or system contributions that make the work stand out.

fine-tuning

audio augmentation

speaker diarization