CR-CTC: Consistency regularization on CTC for improved speech recognition

๐Ÿ“… 2024-10-07
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
CTC, while computationally efficient for automatic speech recognition (ASR), suffers from suboptimal performance due to its sharp output distribution and lack of contextual modeling. To address this, we propose Consistency-Regularized CTC (CR-CTC), the first method to introduce self-consistency regularization into the CTC framework. CR-CTC generates multiple augmented views of mel-spectrogram inputs, applies independent CTC modeling to each view, and enforces consistency among their output distributions via KL-divergence constraints. This yields implicit self-distillation and context-aware temporal masking representation learning, effectively mitigating CTCโ€™s peaky output problem. Crucially, CR-CTC requires no architectural modifications or auxiliary decodersโ€”only a principled loss-function redesign. Evaluated on LibriSpeech, AISHELL-1, and GigaSpeech, CR-CTC consistently outperforms standard CTC, matches or exceeds the accuracy of RNN-T and CTC/attention hybrid systems, and achieves state-of-the-art performance across benchmarks.

Technology Category

Application Category

๐Ÿ“ Abstract
Connectionist Temporal Classification (CTC) is a widely used method for automatic speech recognition (ASR), renowned for its simplicity and computational efficiency. However, it often falls short in recognition performance. In this work, we propose the Consistency-Regularized CTC (CR-CTC), which enforces consistency between two CTC distributions obtained from different augmented views of the input speech mel-spectrogram. We provide in-depth insights into its essential behaviors from three perspectives: 1) it conducts self-distillation between random pairs of sub-models that process different augmented views; 2) it learns contextual representation through masked prediction for positions within time-masked regions, especially when we increase the amount of time masking; 3) it suppresses the extremely peaky CTC distributions, thereby reducing overfitting and improving the generalization ability. Extensive experiments on LibriSpeech, Aishell-1, and GigaSpeech datasets demonstrate the effectiveness of our CR-CTC. It significantly improves the CTC performance, achieving state-of-the-art results comparable to those attained by transducer or systems combining CTC and attention-based encoder-decoder (CTC/AED). We release our code at https://github.com/k2-fsa/icefall.
Problem

Research questions and friction points this paper is trying to address.

Improves CTC speech recognition
Enhances contextual representation learning
Reduces overfitting in CTC distributions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-distillation between sub-models
Contextual learning via masked prediction
Suppression of peaky CTC distributions
๐Ÿ”Ž Similar Papers
No similar papers found.