M-CIF: Multi-Scale Alignment For CIF-Based Non-Autoregressive ASR

๐Ÿ“… 2025-10-25
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address unstable cross-lingual (particularly Englishโ€“French) acoustic-text alignment in non-autoregressive speech recognition using the continuous Integrate-and-Fire (CIF) mechanism, this paper proposes a Multi-scale CIF (M-CIF) model. M-CIF introduces progressive supervision signals at both phoneme and character levels within the continuous integration-and-firing framework, enabling multi-granularity alignment; it further enhances subword representation alignment robustness via knowledge distillation. Experiments on CommonVoice show that M-CIF significantly improves performance, reducing WER by 4.21% for German and 3.05% for French. Moreover, Peak Error (PE) and Segmentation Error (SE) metrics confirm the effectiveness of its hierarchical alignment modeling. To our knowledge, this is the first work to incorporate explicit multi-scale supervision into the CIF framework, establishing a more stable and interpretable alignment paradigm for cross-lingual non-autoregressive speech recognition.

Technology Category

Application Category

๐Ÿ“ Abstract
The Continuous Integrate-and-Fire (CIF) mechanism provides effective alignment for non-autoregressive (NAR) speech recognition. This mechanism creates a smooth and monotonic mapping from acoustic features to target tokens, achieving performance on Mandarin competitive with other NAR approaches. However, without finer-grained guidance, its stability degrades in some languages such as English and French. In this paper, we propose Multi-scale CIF (M-CIF), which performs multi-level alignment by integrating character and phoneme level supervision progressively distilled into subword representations, thereby enhancing robust acoustic-text alignment. Experiments show that M-CIF reduces WER compared to the Paraformer baseline, especially on CommonVoice by 4.21% in German and 3.05% in French. To further investigate these gains, we define phonetic confusion errors (PE) and space-related segmentation errors (SE) as evaluation metrics. Analysis of these metrics across different M-CIF settings reveals that the phoneme and character layers are essential for enhancing progressive CIF alignment.
Problem

Research questions and friction points this paper is trying to address.

Improves alignment stability for non-autoregressive speech recognition
Enhances acoustic-text alignment using multi-level phoneme supervision
Reduces phonetic confusion and segmentation errors in multilingual ASR
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-scale alignment integrating character and phoneme supervision
Progressive distillation into subword representations
Enhanced robust acoustic-text alignment for stability
๐Ÿ”Ž Similar Papers
No similar papers found.
R
Ruixiang Mao
School of Computer Science and Engineering, Northeastern University, Shenyang, China
X
Xiangnan Ma
School of Computer Science and Engineering, Northeastern University, Shenyang, China
Q
Qing Yang
School of Computer Science and Engineering, Northeastern University, Shenyang, China
Z
Ziming Zhu
School of Computer Science and Engineering, Northeastern University, Shenyang, China
Y
Yucheng Qiao
School of Computer Science and Engineering, Northeastern University, Shenyang, China
Yuan Ge
Yuan Ge
Northeastern University, China
ReasoningMultimodality LLMs
T
Tong Xiao
School of Computer Science and Engineering, Northeastern University, Shenyang, China; NiuTrans Research
S
Shengxiang Gao
Kunming University of Science and Technology, China
Zhengtao Yu
Zhengtao Yu
Kunming University of Science and Technology
Jingbo Zhu
Jingbo Zhu
Northeastern University, China
Machine TranslationLanguage ParsingNatural Language Processing