Joint Automatic Speech Recognition And Structure Learning For Better Speech Understanding

📅 2025-01-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing end-to-end spoken language understanding (SLU) approaches struggle to simultaneously achieve high-accuracy automatic speech recognition (ASR) and structured semantic parsing. To address this, we propose JSRSL—a Joint Speech Recognition and Structured Learning framework based on semantic spans—marking the first unified span-based end-to-end paradigm for ASR and structured learning. JSRSL employs multi-task learning with explicit speech–semantic alignment to jointly optimize ASR, named entity recognition (NER), and intent classification, balancing real-time inference and structural fidelity. Evaluated on the bilingual AISHELL-NER (Chinese) and SLURP (English) benchmarks, JSRSL achieves state-of-the-art performance across all three tasks, consistently outperforming conventional sequence-to-sequence baselines. It significantly improves both accuracy and structural consistency in end-to-end SLU, demonstrating superior generalizability and robustness in multilingual spoken language understanding.

Technology Category

Application Category

📝 Abstract
Spoken language understanding (SLU) is a structure prediction task in the field of speech. Recently, many works on SLU that treat it as a sequence-to-sequence task have achieved great success. However, This method is not suitable for simultaneous speech recognition and understanding. In this paper, we propose a joint speech recognition and structure learning framework (JSRSL), an end-to-end SLU model based on span, which can accurately transcribe speech and extract structured content simultaneously. We conduct experiments on name entity recognition and intent classification using the Chinese dataset AISHELL-NER and the English dataset SLURP. The results show that our proposed method not only outperforms the traditional sequence-to-sequence method in both transcription and extraction capabilities but also achieves state-of-the-art performance on the two datasets.
Problem

Research questions and friction points this paper is trying to address.

Spoken Language Understanding
Structure Learning
Content Recognition
Innovation

Methods, ideas, or system contributions that make the work stand out.

Joint Speech Recognition
Structural Learning
Spoken Language Understanding
🔎 Similar Papers
No similar papers found.
J
Jiliang Hu
Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, Wuhan, China
Zuchao Li
Zuchao Li
Wuhan University
Natural Language ProcessingMachine Learning
Mengjia Shen
Mengjia Shen
Lanzhou University
graph mining
H
Haojun Ai
Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, Wuhan, China
S
Sheng Li
National Institute of Information and Communications Technology, Japan
J
Jun Zhang
Wuhan Second Ship Design and Research Institute, Wuhan, China