Language Model Integration Based on Memory Control for Sequence to Sequence Speech Recognition

📅 2018-11-06
🏛️ IEEE International Conference on Acoustics, Speech, and Signal Processing
📈 Citations: 5
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of deeply integrating pretrained language models (LMs) with sequence-to-sequence decoders in end-to-end speech recognition, this paper proposes a memory-unit co-control LM fusion method. It innovatively employs external LM outputs to dynamically modulate the memory cell states of an LSTM decoder, enabling semantic-aware hidden state refinement. Unlike conventional shallow fusion or two-stage transfer approaches, the proposed method supports multi-level decoding and single- or multilingual transfer learning. On LibriSpeech, it achieves relative WER reductions of 3.7% on test-clean and 2.4% on test-other compared to shallow fusion baselines. In low-resource Swahili transfer experiments, it improves CER and WER by 9.9% and 9.8%, respectively, over two-stage baselines—demonstrating significantly enhanced cross-lingual generalization capability.
📝 Abstract
In this paper, we explore several new schemes to train a seq2seq model to integrate a pre-trained language model (LM). Our proposed fusion methods focus on the memory cell state and the hidden state in the seq2seq decoder long short-term memory (LSTM), and the memory cell state is updated by the LM unlike the prior studies. This means the memory retained by the main seq2seq would be adjusted by the external LM. These fusion methods have several variants depending on the architecture of this memory cell update and the use of memory cell and hidden states which directly affects the final label inference. We performed the experiments to show the effectiveness of the proposed methods in a mono-lingual ASR setup on the Librispeech corpus and in a transfer learning setup from a multilingual ASR (MLASR) base model to a low-resourced language. In Librispeech, our best model improved WER by 3.7%, 2.4% for test clean, test other relatively to the shallow fusion baseline, with multilevel decoding. In transfer learning from an MLASR base model to the IARPA Babel Swahili model, the best scheme improved the transferred model on eval set by 9.9%, 9.8% in CER, WER relatively to the 2-stage transfer baseline.
Problem

Research questions and friction points this paper is trying to address.

Integrate pre-trained language model into seq2seq speech recognition.
Improve memory cell state updates using external language model.
Enhance ASR performance in mono-lingual and low-resource language setups.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates pre-trained LM into seq2seq model
Updates memory cell state using external LM
Improves WER and CER in ASR tasks
🔎 Similar Papers
No similar papers found.
J
Jaejin Cho
Johns Hopkins University
Shinji Watanabe
Shinji Watanabe
Carnegie Mellon University
Speech recognitionSpeech processingSpeech enhancementSpeech translation
Takaaki Hori
Takaaki Hori
Apple
Speech RecognitionSpoken Language ProcessingMachine Learning
M
M. Baskar
Brno University of Technology
H
H. Inaguma
Kyoto University
J
J. Villalba
Johns Hopkins University
N
N. Dehak
Johns Hopkins University