The TCG CREST -- RKMVERI Submission for the NCIIPC Startup India AI Grand Challenge

📅 2025-12-11

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work addresses the challenge of robust speaker diarization (SD) in multilingual and code-switched scenarios, where low-resource conditions severely degrade SD performance. We propose a language-agnostic end-to-end SD–ASR–NMT joint pipeline. To enhance SD robustness, we introduce a novel multi-kernel consensus spectral clustering framework that integrates lightweight voice activity detection (VAD), fine-tuned ECAPA-TDNN speaker embeddings, multilingual ASR (Whisper/XLS-R), and neural machine translation, augmented by language identification and rule-based post-processing. To our knowledge, this is the first work to empirically validate the engineering feasibility of full-chain co-optimization of SD–ASR–NMT on real-world multilingual mixed audio. Evaluated on the NCIIPC challenge training set, our system reduces diarization error rate (DER) by 32% over baseline methods, supports Hindi, Tamil, English, and their code-switched combinations, achieves an end-to-end real-time factor <1.8×, and significantly improves cross-lingual generalization and system robustness.

Technology Category

Application Category

📝 Abstract

In this report, we summarize the integrated multilingual audio processing pipeline developed by our team for the inaugural NCIIPC Startup India AI GRAND CHALLENGE, addressing Problem Statement 06: Language-Agnostic Speaker Identification and Diarisation, and subsequent Transcription and Translation System. Our primary focus was on advancing speaker diarization, a critical component for multilingual and code-mixed scenarios. The main intent of this work was to study the real-world applicability of our in-house speaker diarization (SD) systems. To this end, we investigated a robust voice activity detection (VAD) technique and fine-tuned speaker embedding models for improved speaker identification in low-resource settings. We leveraged our own recently proposed multi-kernel consensus spectral clustering framework, which substantially improved the diarization performance across all recordings in the training corpus provided by the organizers. Complementary modules for speaker and language identification, automatic speech recognition (ASR), and neural machine translation were integrated in the pipeline. Post-processing refinements further improved system robustness.

Problem

Research questions and friction points this paper is trying to address.

Developed a multilingual audio pipeline for speaker identification and diarization

Enhanced speaker diarization in low-resource and code-mixed scenarios

Integrated complementary modules including speech recognition and translation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuned speaker embedding models for low-resource settings

Multi-kernel consensus spectral clustering for diarization

Integrated multilingual pipeline with ASR and translation

🔎 Similar Papers

No similar papers found.