Joint Speech and Text Training for LLM-Based End-to-End Spoken Dialogue State Tracking

📅 2025-11-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the weak cross-domain generalization and reliance on target-domain speech annotations in end-to-end spoken dialogue state tracking (DST), this paper proposes a multimodal joint training paradigm that requires no target-domain speech transcriptions. Methodologically, it integrates a speech foundation encoder with a large language model, jointly optimizing both spoken DST data and multi-domain textual DST data to enable collaborative learning of speech and text representations within a unified framework. The key contribution is the first demonstration of zero-shot target-domain speech annotation for cross-domain spoken DST, effectively alleviating the critical bottleneck of speech data scarcity. Experimental results show that the method significantly outperforms strong baselines under cross-domain settings and achieves state-of-the-art performance even without any target-domain speech training data.

Technology Category

Application Category

📝 Abstract
End-to-end spoken dialogue state tracking (DST) is made difficult by the tandem of having to handle speech input and data scarcity. Combining speech foundation encoders and large language models has been proposed in recent work as to alleviate some of this difficulty. Although this approach has been shown to result in strong spoken DST models, achieving state-of-the-art performance in realistic multi-turn DST, it struggles to generalize across domains and requires annotated spoken DST training data for each domain of interest. However, collecting such data for every target domain is both costly and difficult. Noting that textual DST data is more easily obtained for various domains, in this work, we propose jointly training on available spoken DST data and written textual data from other domains as a way to achieve cross-domain generalization. We conduct experiments which show the efficacy of our proposed method for getting good cross-domain DST performance without relying on spoken training data from the target domains.
Problem

Research questions and friction points this paper is trying to address.

Improves cross-domain generalization for spoken dialogue state tracking
Reduces reliance on costly annotated spoken training data
Leverages joint training with textual data from other domains
Innovation

Methods, ideas, or system contributions that make the work stand out.

Joint training on speech and text data
Combining speech encoders with large language models
Cross-domain generalization without target spoken data
🔎 Similar Papers
No similar papers found.
K
Katia Vendrame
Speech@FIT, Brno University of Technology, Czechia
Bolaji Yusuf
Bolaji Yusuf
Researcher, Brno University of Technology
Speech recognitionSpoken term detection
Santosh Kesiraju
Santosh Kesiraju
Brno University of Technology
Speech and language processingMachine learning
Š
Šimon Sedláček
Speech@FIT, Brno University of Technology, Czechia
Oldřich Plchot
Oldřich Plchot
Researcher, Brno University of Technology
Pattern recognitionspeech processingcomputer networks
J
Jan Černocký
Speech@FIT, Brno University of Technology, Czechia