🤖 AI Summary
Existing unimodal aggregation (UMA)-based non-autoregressive speech recognition models suffer performance degradation on English due to overly fine-grained syllable–token alignments—e.g., one token aligned to fewer than three frames—rendering unimodal weights ineffective.
Method: We propose UMA-Split, a CTC-based framework introducing a splittable unimodal aggregation mechanism that permits one acoustic frame to map to multiple text tokens, thereby relaxing the conventional monotonic alignment constraint. A lightweight splitting module is integrated to enable flexible frame-to-multi-token mapping.
Contribution/Results: UMA-Split unifies modeling across English and Mandarin, significantly improving cross-lingual generalization. Experiments demonstrate consistent superiority over standard CTC and the original UMA on both language datasets, validating its effectiveness in mitigating fine-grained alignment issues and enhancing linguistic adaptability.
📝 Abstract
This paper proposes a unimodal aggregation (UMA) based nonautoregressive model for both English and Mandarin speech recognition. The original UMA explicitly segments and aggregates acoustic frames (with unimodal weights that first monotonically increase and then decrease) of the same text token to learn better representations than regular connectionist temporal classification (CTC). However, it only works well in Mandarin. It struggles with other languages, such as English, for which a single syllable may be tokenized into multiple fine-grained tokens, or a token spans fewer than 3 acoustic frames and fails to form unimodal weights. To address this problem, we propose allowing each UMA-aggregated frame map to multiple tokens, via a simple split module that generates two tokens from each aggregated frame before computing the CTC loss.