LUPET: Incorporating Hierarchical Information Path into Multilingual ASR

📅 2024-01-08
🏛️ Interspeech
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the performance imbalance between high- and low-resource languages in multilingual automatic speech recognition (ASR), this paper proposes a hierarchical information fusion architecture. The method integrates, from bottom to top: language identification prediction, unsupervised acoustic unit discovery, cross-lingual phoneme sharing, and Mixture-of-Experts (MoE)-driven expert routing for token recognition. This design establishes the first multi-granularity language-acoustic co-modeling pathway, unifying shallow language-aware representation with deep acoustic generalization. Evaluated on joint training across 10 languages from Common Voice, the model maintains high-resource language performance while reducing average word error rate (WER) on low-resource languages by 12.3%, significantly mitigating resource-dependency bias. The core contribution is a scalable hierarchical representation pathway that offers a novel, generalizable yet adaptable paradigm for low-resource ASR.

Technology Category

Application Category

📝 Abstract
Toward high-performance multilingual automatic speech recognition (ASR), various types of linguistic information and model design have demonstrated their effectiveness independently. They include language identity (LID), phoneme information, language-specific processing modules, and cross-lingual self-supervised speech representation. It is expected that leveraging their benefits synergistically in a unified solution would further improve the overall system performance. This paper presents a novel design of a hierarchical information path, named LUPET, which sequentially encodes, from the shallow layers to deep layers, multiple aspects of linguistic and acoustic information at diverse granularity scales. The path starts from LID prediction, followed by acoustic unit discovery, phoneme sharing, and finally token recognition routed by a mixture-of-expert. ASR experiments are carried out on 10 languages in the Common Voice corpus. The results demonstrate the superior performance of LUPET as compared to the baseline systems. Most importantly, LUPET effectively mitigates the issue of performance compromise of high-resource languages with low-resource ones in the multilingual setting.
Problem

Research questions and friction points this paper is trying to address.

Multilingual Speech Recognition
Low-resource Languages
Performance Improvement
Innovation

Methods, ideas, or system contributions that make the work stand out.

LUPET
self-supervised learning
multilingual speech recognition
🔎 Similar Papers
No similar papers found.
W
Wei Liu
Department of Electronic Engineering, The Chinese University of Hong Kong
Jingyong Hou
Jingyong Hou
GV oice, Tencent
D
Dong Yang
GV oice, Tencent
M
Muyong Cao
GV oice, Tencent
T
Tan Lee
Department of Electronic Engineering, The Chinese University of Hong Kong