CR-CTC: Consistency regularization on CTC for improved speech recognition

📅 2024-10-07

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

154K/year

🤖 AI Summary

CTC, while computationally efficient for automatic speech recognition (ASR), suffers from suboptimal performance due to its sharp output distribution and lack of contextual modeling. To address this, we propose Consistency-Regularized CTC (CR-CTC), the first method to introduce self-consistency regularization into the CTC framework. CR-CTC generates multiple augmented views of mel-spectrogram inputs, applies independent CTC modeling to each view, and enforces consistency among their output distributions via KL-divergence constraints. This yields implicit self-distillation and context-aware temporal masking representation learning, effectively mitigating CTC’s peaky output problem. Crucially, CR-CTC requires no architectural modifications or auxiliary decoders—only a principled loss-function redesign. Evaluated on LibriSpeech, AISHELL-1, and GigaSpeech, CR-CTC consistently outperforms standard CTC, matches or exceeds the accuracy of RNN-T and CTC/attention hybrid systems, and achieves state-of-the-art performance across benchmarks.

Technology Category

Application Category

📝 Abstract

Connectionist Temporal Classification (CTC) is a widely used method for automatic speech recognition (ASR), renowned for its simplicity and computational efficiency. However, it often falls short in recognition performance. In this work, we propose the Consistency-Regularized CTC (CR-CTC), which enforces consistency between two CTC distributions obtained from different augmented views of the input speech mel-spectrogram. We provide in-depth insights into its essential behaviors from three perspectives: 1) it conducts self-distillation between random pairs of sub-models that process different augmented views; 2) it learns contextual representation through masked prediction for positions within time-masked regions, especially when we increase the amount of time masking; 3) it suppresses the extremely peaky CTC distributions, thereby reducing overfitting and improving the generalization ability. Extensive experiments on LibriSpeech, Aishell-1, and GigaSpeech datasets demonstrate the effectiveness of our CR-CTC. It significantly improves the CTC performance, achieving state-of-the-art results comparable to those attained by transducer or systems combining CTC and attention-based encoder-decoder (CTC/AED). We release our code at https://github.com/k2-fsa/icefall.

Problem

Research questions and friction points this paper is trying to address.

Improves CTC speech recognition

Enhances contextual representation learning

Reduces overfitting in CTC distributions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-distillation between sub-models

Contextual learning via masked prediction

Suppression of peaky CTC distributions

🔎 Similar Papers

No similar papers found.