Approaching Dialogue State Tracking via Aligning Speech Encoders and LLMs

📅 2025-06-10

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This work addresses the lack of end-to-end open-source solutions for spoken dialogue state tracking (DST). We propose the first fully open-source, end-to-end spoken DST framework built exclusively on open models and data. Methodologically, we design a lightweight representation alignment module to bridge the latent spaces of the WavLM-large speech encoder and the OLMo-1B or Gemma-2-9B-instruct language model; incorporate LoRA-based efficient fine-tuning and turn-level dialogue modeling; and introduce a novel fuzzy string matching post-processing step to enhance robustness in named-entity slot value identification. On the SpokenWOZ test set, our framework achieves a joint goal accuracy (JGA) of 42.17%, establishing a new state-of-the-art—improving upon the prior best by 7.51 percentage points. Key contributions include: (1) the first end-to-end open-source spoken DST system; (2) a cross-modal speech–text representation alignment mechanism; and (3) a lightweight adaptation and post-processing paradigm tailored for spoken DST.

Technology Category

Application Category

📝 Abstract

In this work, we approach spoken Dialogue State Tracking (DST) by bridging the representation spaces of speech encoders and LLMs via a small connector module, with a focus on fully open-sourced and open-data components (WavLM-large, OLMo). We focus on ablating different aspects of such systems including full/LoRA adapter fine-tuning, the effect of agent turns in the dialogue history, as well as fuzzy matching-based output post-processing, which greatly improves performance of our systems on named entities in the dialogue slot values. We conduct our experiments on the SpokenWOZ dataset, and additionally utilize the Speech-Aware MultiWOZ dataset to augment our training data. Ultimately, our best-performing WavLM + connector + OLMo-1B aligned models achieve state of the art on the SpokenWOZ test set (34.66% JGA), and our system with Gemma-2-9B-instruct further surpasses this result, reaching 42.17% JGA on SpokenWOZ test.

Problem

Research questions and friction points this paper is trying to address.

Bridging speech encoders and LLMs for spoken Dialogue State Tracking

Improving named entity recognition in dialogue slot values

Achieving state-of-the-art performance on SpokenWOZ dataset

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bridges speech encoders and LLMs via connector module

Uses open-sourced WavLM-large and OLMo components

Incorporates fuzzy matching for output post-processing

🔎 Similar Papers

A Zero-Shot Open-Vocabulary Pipeline for Dialogue Understanding