Improving Transducer-Based Spoken Language Understanding with Self-Conditioned CTC and Knowledge Transfer

📅 2025-01-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address insufficient acoustic-semantic coupling and inefficient knowledge transfer in end-to-end spoken language understanding (SLU), this paper proposes a differentiable cascaded modeling paradigm within the RNN-Transducer (RNN-T) framework. Our approach introduces three key innovations: (1) a novel self-regulating CTC objective enabling joint, differentiable optimization of ASR and SLU; (2) a cross-modal alignment mechanism between acoustic embeddings and BERT-derived semantic representations, explicitly strengthening SLU decoder reliance on semantically enriched acoustic features; and (3) a bag-of-entities prediction layer facilitating fine-grained semantic knowledge transfer. Experiments demonstrate that our method achieves state-of-the-art SLU performance while reducing model parameters by approximately 60% compared to Whisper—outperforming multiple strong baselines across all evaluated metrics.

Technology Category

Application Category

📝 Abstract
In this paper, we propose to improve end-to-end (E2E) spoken language understand (SLU) in an RNN transducer model (RNN-T) by incorporating a joint self-conditioned CTC automatic speech recognition (ASR) objective. Our proposed model is akin to an E2E differentiable cascaded model which performs ASR and SLU sequentially and we ensure that the SLU task is conditioned on the ASR task by having CTC self conditioning. This novel joint modeling of ASR and SLU improves SLU performance significantly over just using SLU optimization. We further improve the performance by aligning the acoustic embeddings of this model with the semantically richer BERT model. Our proposed knowledge transfer strategy makes use of a bag-of-entity prediction layer on the aligned embeddings and the output of this is used to condition the RNN-T based SLU decoding. These techniques show significant improvement over several strong baselines and can perform at par with large models like Whisper with significantly fewer parameters.
Problem

Research questions and friction points this paper is trying to address.

Speech Understanding
Accuracy Improvement
Spoken Language Processing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-regulating CTC
Knowledge Transfer
Enhanced RNN-T Model
🔎 Similar Papers
No similar papers found.