UMA-Split: unimodal aggregation for both English and Mandarin non-autoregressive speech recognition

📅 2025-09-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing unimodal aggregation (UMA)-based non-autoregressive speech recognition models suffer performance degradation on English due to overly fine-grained syllable–token alignments—e.g., one token aligned to fewer than three frames—rendering unimodal weights ineffective. Method: We propose UMA-Split, a CTC-based framework introducing a splittable unimodal aggregation mechanism that permits one acoustic frame to map to multiple text tokens, thereby relaxing the conventional monotonic alignment constraint. A lightweight splitting module is integrated to enable flexible frame-to-multi-token mapping. Contribution/Results: UMA-Split unifies modeling across English and Mandarin, significantly improving cross-lingual generalization. Experiments demonstrate consistent superiority over standard CTC and the original UMA on both language datasets, validating its effectiveness in mitigating fine-grained alignment issues and enhancing linguistic adaptability.

Technology Category

Application Category

📝 Abstract
This paper proposes a unimodal aggregation (UMA) based nonautoregressive model for both English and Mandarin speech recognition. The original UMA explicitly segments and aggregates acoustic frames (with unimodal weights that first monotonically increase and then decrease) of the same text token to learn better representations than regular connectionist temporal classification (CTC). However, it only works well in Mandarin. It struggles with other languages, such as English, for which a single syllable may be tokenized into multiple fine-grained tokens, or a token spans fewer than 3 acoustic frames and fails to form unimodal weights. To address this problem, we propose allowing each UMA-aggregated frame map to multiple tokens, via a simple split module that generates two tokens from each aggregated frame before computing the CTC loss.
Problem

Research questions and friction points this paper is trying to address.

Improves cross-lingual speech recognition accuracy
Enables unimodal aggregation for fine-grained tokens
Solves unimodal weight formation in short token spans
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unimodal aggregation for multilingual speech recognition
Split module mapping frames to multiple tokens
CTC loss computation with enhanced token generation
🔎 Similar Papers
No similar papers found.
Ying Fang
Ying Fang
Westlake University; Zhejiang University
speech recognition
X
Xiaofei Li
2School of Engineering, Westlake University, China; 3Institute of Advanced Technology, Westlake Institute for Advanced Study, China