Exploring SSL Discrete Speech Features for Zipformer-based Contextual ASR

📅 2024-09-13
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the effectiveness of self-supervised discrete speech tokens (WavLM tokens) for modeling both intra-sentence and cross-sentence contextual information—specifically preceding and succeeding utterances—in a Zipformer-Transducer end-to-end ASR system. Unlike conventional Fbank features, we systematically integrate discrete acoustic tokens as explicit cross-sentence context into the Zipformer encoder, enabling joint modeling of intra- and inter-utterance representations. Evaluated on Gigaspeech, our approach achieves WERs of 11.15% (dev) and 11.14% (test), yielding absolute improvements of 0.32–0.41% (relative gains of 2.78–3.54%) over a baseline using only intra-sentence context, establishing a new state-of-the-art among publicly reported results. Key contributions include: (i) empirical validation of discrete speech tokens as effective cross-sentence acoustic context; (ii) a scalable context fusion mechanism that supports flexible integration of multi-utterance information; and (iii) the first demonstrated significant WER reduction using self-supervised learning features within the Zipformer-Transducer architecture.

Technology Category

Application Category

📝 Abstract
Self-supervised learning (SSL) based discrete speech representations are highly compact and domain adaptable. In this paper, SSL discrete speech features extracted from WavLM models are used as additional cross-utterance acoustic context features in Zipformer-Transducer ASR systems. The efficacy of replacing Fbank features with discrete token features for modelling either cross-utterance contexts (from preceding and future segments), or current utterance's internal contexts alone, or both at the same time, are demonstrated thoroughly on the Gigaspeech 1000-hr corpus. The best Zipformer-Transducer system using discrete tokens based cross-utterance context features outperforms the baseline using utterance internal context only with statistically significant word error rate (WER) reductions of 0.32% to 0.41% absolute (2.78% to 3.54% relative) on the dev and test data. The lowest published WER of 11.15% and 11.14% were obtained on the dev and test sets. Our work is open-source and publicly available at https://github.com/open-creator/icefall/tree/master/egs/gigaspeech/Context_ASR.
Problem

Research questions and friction points this paper is trying to address.

Exploring SSL discrete features for contextual ASR
Comparing Fbank and discrete tokens in Zipformer-Transducer
Improving WER with cross-utterance acoustic context
Innovation

Methods, ideas, or system contributions that make the work stand out.

SSL discrete speech features enhance Zipformer-Transducer ASR
WavLM models provide cross-utterance acoustic context features
Discrete tokens replace Fbank for contextual modeling
🔎 Similar Papers
No similar papers found.