VoxVietnam: a Large-Scale Multi-Genre Dataset for Vietnamese Speaker Recognition

📅 2024-12-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The absence of large-scale, multi-genre (e.g., news, interviews, podcasts, telephone) benchmark datasets hinders robust speaker recognition for Vietnamese, a low-resource language. Method: We introduce VoxVietnam—the first open-source, multi-genre Vietnamese speaker recognition dataset—comprising 187,000 utterances from 1,406 speakers. We propose a fully automated, reproducible end-to-end pipeline integrating web crawling, ASR-based transcription, speaker diarization via spectral clustering, and rigorous quality filtering. Contribution/Results: We systematically model and empirically validate the substantial impact of genre variation on speaker verification performance: models trained on a single genre suffer a 12.6% increase in equal error rate (EER) when evaluated cross-genre. Leveraging VoxVietnam for training reduces cross-genre EER by 38.2%, significantly improving generalization. This work establishes a critical data foundation and methodological paradigm for robust speaker recognition in low-resource languages.

Technology Category

Application Category

📝 Abstract
Recent research in speaker recognition aims to address vulnerabilities due to variations between enrolment and test utterances, particularly in the multi-genre phenomenon where the utterances are in different speech genres. Previous resources for Vietnamese speaker recognition are either limited in size or do not focus on genre diversity, leaving studies in multi-genre effects unexplored. This paper introduces VoxVietnam, the first multi-genre dataset for Vietnamese speaker recognition with over 187,000 utterances from 1,406 speakers and an automated pipeline to construct a dataset on a large scale from public sources. Our experiments show the challenges posed by the multi-genre phenomenon to models trained on a single-genre dataset, and demonstrate a significant increase in performance upon incorporating the VoxVietnam into the training process. Our experiments are conducted to study the challenges of the multi-genre phenomenon in speaker recognition and the performance gain when the proposed dataset is used for multi-genre training.
Problem

Research questions and friction points this paper is trying to address.

Vietnamese speaker recognition
Diverse speech types
Performance evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

VoxVietnam
Multigenre Vietnamese Speech Recognition
Cross-genre Recognition Performance
🔎 Similar Papers
2024-04-08International Conference on Language Resources and EvaluationCitations: 8
H
Hoang Long Vu
Hanoi University of Science and Technology, Hanoi, Vietnam
P
P. Dat
Hanoi University of Science and Technology, Hanoi, Vietnam
P
P. Nhi
Hanoi University of Science and Technology, Hanoi, Vietnam
N
Nguyen Song Hao
Hanoi University of Science and Technology, Hanoi, Vietnam
Nguyen Thi Thu Trang
Nguyen Thi Thu Trang
Lecturer & Researcher, School of Information and Communication Technology, Hanoi University of
Speech SynthesisSpeaker RecognitionSpeech TechnologyNatural Language Processing