🤖 AI Summary
This work addresses the lack of end-to-end open-source solutions for spoken dialogue state tracking (DST). We propose the first fully open-source, end-to-end spoken DST framework built exclusively on open models and data. Methodologically, we design a lightweight representation alignment module to bridge the latent spaces of the WavLM-large speech encoder and the OLMo-1B or Gemma-2-9B-instruct language model; incorporate LoRA-based efficient fine-tuning and turn-level dialogue modeling; and introduce a novel fuzzy string matching post-processing step to enhance robustness in named-entity slot value identification. On the SpokenWOZ test set, our framework achieves a joint goal accuracy (JGA) of 42.17%, establishing a new state-of-the-art—improving upon the prior best by 7.51 percentage points. Key contributions include: (1) the first end-to-end open-source spoken DST system; (2) a cross-modal speech–text representation alignment mechanism; and (3) a lightweight adaptation and post-processing paradigm tailored for spoken DST.
📝 Abstract
In this work, we approach spoken Dialogue State Tracking (DST) by bridging the representation spaces of speech encoders and LLMs via a small connector module, with a focus on fully open-sourced and open-data components (WavLM-large, OLMo). We focus on ablating different aspects of such systems including full/LoRA adapter fine-tuning, the effect of agent turns in the dialogue history, as well as fuzzy matching-based output post-processing, which greatly improves performance of our systems on named entities in the dialogue slot values. We conduct our experiments on the SpokenWOZ dataset, and additionally utilize the Speech-Aware MultiWOZ dataset to augment our training data. Ultimately, our best-performing WavLM + connector + OLMo-1B aligned models achieve state of the art on the SpokenWOZ test set (34.66% JGA), and our system with Gemma-2-9B-instruct further surpasses this result, reaching 42.17% JGA on SpokenWOZ test.