Approaching Dialogue State Tracking via Aligning Speech Encoders and LLMs

📅 2025-06-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of end-to-end open-source solutions for spoken dialogue state tracking (DST). We propose the first fully open-source, end-to-end spoken DST framework built exclusively on open models and data. Methodologically, we design a lightweight representation alignment module to bridge the latent spaces of the WavLM-large speech encoder and the OLMo-1B or Gemma-2-9B-instruct language model; incorporate LoRA-based efficient fine-tuning and turn-level dialogue modeling; and introduce a novel fuzzy string matching post-processing step to enhance robustness in named-entity slot value identification. On the SpokenWOZ test set, our framework achieves a joint goal accuracy (JGA) of 42.17%, establishing a new state-of-the-art—improving upon the prior best by 7.51 percentage points. Key contributions include: (1) the first end-to-end open-source spoken DST system; (2) a cross-modal speech–text representation alignment mechanism; and (3) a lightweight adaptation and post-processing paradigm tailored for spoken DST.

Technology Category

Application Category

📝 Abstract
In this work, we approach spoken Dialogue State Tracking (DST) by bridging the representation spaces of speech encoders and LLMs via a small connector module, with a focus on fully open-sourced and open-data components (WavLM-large, OLMo). We focus on ablating different aspects of such systems including full/LoRA adapter fine-tuning, the effect of agent turns in the dialogue history, as well as fuzzy matching-based output post-processing, which greatly improves performance of our systems on named entities in the dialogue slot values. We conduct our experiments on the SpokenWOZ dataset, and additionally utilize the Speech-Aware MultiWOZ dataset to augment our training data. Ultimately, our best-performing WavLM + connector + OLMo-1B aligned models achieve state of the art on the SpokenWOZ test set (34.66% JGA), and our system with Gemma-2-9B-instruct further surpasses this result, reaching 42.17% JGA on SpokenWOZ test.
Problem

Research questions and friction points this paper is trying to address.

Bridging speech encoders and LLMs for spoken Dialogue State Tracking
Improving named entity recognition in dialogue slot values
Achieving state-of-the-art performance on SpokenWOZ dataset
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bridges speech encoders and LLMs via connector module
Uses open-sourced WavLM-large and OLMo components
Incorporates fuzzy matching for output post-processing
🔎 Similar Papers
No similar papers found.
S
Simon Sedl'avcek
Speech@FIT, Brno University of Technology, Czechia
Bolaji Yusuf
Bolaji Yusuf
Researcher, Brno University of Technology
Speech recognitionSpoken term detection
J
J'an vSvec
Speech@FIT, Brno University of Technology, Czechia
Pradyoth Hegde
Pradyoth Hegde
Indian Institute of Information Technology Dharwad
speech processing
Santosh Kesiraju
Santosh Kesiraju
Brno University of Technology
Speech and language processingMachine learning
O
Oldvrich Plchot
Speech@FIT, Brno University of Technology, Czechia
J
JanHonza'' vCernock'y
Speech@FIT, Brno University of Technology, Czechia