🤖 AI Summary
Multilingual automatic speech recognition (ASR) suffers from inadequate modeling of long-range linguistic dependencies and poor cross-lingual generalization. Method: We propose Funnel-TDNN, a funnel-shaped time-delay neural network architecture that employs progressive temporal downsampling, optimized deep-layer design, and a dedicated temporal pooling layer to jointly capture cross-lingual acoustic–linguistic patterns. The model integrates a TDNN backbone, is trained on multilingual Common Voice data, and leverages targeted data augmentation and hyperparameter grid search. Results: Evaluated on a benchmark comprising 10 languages spanning Indo-European, Afro-Asiatic, and East Asian language families, the system achieves an average 97% relative reduction in word error rate (WER) over baselines—significantly outperforming prior approaches—while simultaneously improving language identification accuracy and noise robustness. These results empirically validate the critical role of long-range temporal modeling in multilingual ASR.
📝 Abstract
In this research, we advanced a spoken language recognition system, moving beyond traditional feature vector-based models. Our improvements focused on effectively capturing language characteristics over extended periods using a specialized pooling layer. We utilized a broad dataset range from Common-Voice, targeting ten languages across Indo-European, Semitic, and East Asian families. The major innovation involved optimizing the architecture of Time Delay Neural Networks. We introduced additional layers and restructured these networks into a funnel shape, enhancing their ability to process complex linguistic patterns. A rigorous grid search determined the optimal settings for these networks, significantly boosting their efficiency in language pattern recognition from audio samples. The model underwent extensive training, including a phase with augmented data, to refine its capabilities. The culmination of these efforts is a highly accurate system, achieving a 97% accuracy rate in language recognition. This advancement represents a notable contribution to artificial intelligence, specifically in improving the accuracy and efficiency of language processing systems, a critical aspect in the engineering of advanced speech recognition technologies.