UMA-Split: unimodal aggregation for both English and Mandarin non-autoregressive speech recognition

📅 2025-09-18

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

Existing unimodal aggregation (UMA)-based non-autoregressive speech recognition models suffer performance degradation on English due to overly fine-grained syllable–token alignments—e.g., one token aligned to fewer than three frames—rendering unimodal weights ineffective. Method: We propose UMA-Split, a CTC-based framework introducing a splittable unimodal aggregation mechanism that permits one acoustic frame to map to multiple text tokens, thereby relaxing the conventional monotonic alignment constraint. A lightweight splitting module is integrated to enable flexible frame-to-multi-token mapping. Contribution/Results: UMA-Split unifies modeling across English and Mandarin, significantly improving cross-lingual generalization. Experiments demonstrate consistent superiority over standard CTC and the original UMA on both language datasets, validating its effectiveness in mitigating fine-grained alignment issues and enhancing linguistic adaptability.

Technology Category

Application Category

📝 Abstract

This paper proposes a unimodal aggregation (UMA) based nonautoregressive model for both English and Mandarin speech recognition. The original UMA explicitly segments and aggregates acoustic frames (with unimodal weights that first monotonically increase and then decrease) of the same text token to learn better representations than regular connectionist temporal classification (CTC). However, it only works well in Mandarin. It struggles with other languages, such as English, for which a single syllable may be tokenized into multiple fine-grained tokens, or a token spans fewer than 3 acoustic frames and fails to form unimodal weights. To address this problem, we propose allowing each UMA-aggregated frame map to multiple tokens, via a simple split module that generates two tokens from each aggregated frame before computing the CTC loss.

Problem

Research questions and friction points this paper is trying to address.

Improves cross-lingual speech recognition accuracy

Enables unimodal aggregation for fine-grained tokens

Solves unimodal weight formation in short token spans

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unimodal aggregation for multilingual speech recognition

Split module mapping frames to multiple tokens

CTC loss computation with enhanced token generation

🔎 Similar Papers

Mamba for Streaming ASR Combined with Unimodal Aggregation