🤖 AI Summary
This work addresses the challenge of robust speaker diarization (SD) in multilingual and code-switched scenarios, where low-resource conditions severely degrade SD performance. We propose a language-agnostic end-to-end SD–ASR–NMT joint pipeline. To enhance SD robustness, we introduce a novel multi-kernel consensus spectral clustering framework that integrates lightweight voice activity detection (VAD), fine-tuned ECAPA-TDNN speaker embeddings, multilingual ASR (Whisper/XLS-R), and neural machine translation, augmented by language identification and rule-based post-processing. To our knowledge, this is the first work to empirically validate the engineering feasibility of full-chain co-optimization of SD–ASR–NMT on real-world multilingual mixed audio. Evaluated on the NCIIPC challenge training set, our system reduces diarization error rate (DER) by 32% over baseline methods, supports Hindi, Tamil, English, and their code-switched combinations, achieves an end-to-end real-time factor <1.8×, and significantly improves cross-lingual generalization and system robustness.
📝 Abstract
In this report, we summarize the integrated multilingual audio processing pipeline developed by our team for the inaugural NCIIPC Startup India AI GRAND CHALLENGE, addressing Problem Statement 06: Language-Agnostic Speaker Identification and Diarisation, and subsequent Transcription and Translation System. Our primary focus was on advancing speaker diarization, a critical component for multilingual and code-mixed scenarios. The main intent of this work was to study the real-world applicability of our in-house speaker diarization (SD) systems. To this end, we investigated a robust voice activity detection (VAD) technique and fine-tuned speaker embedding models for improved speaker identification in low-resource settings. We leveraged our own recently proposed multi-kernel consensus spectral clustering framework, which substantially improved the diarization performance across all recordings in the training corpus provided by the organizers. Complementary modules for speaker and language identification, automatic speech recognition (ASR), and neural machine translation were integrated in the pipeline. Post-processing refinements further improved system robustness.