The Speech-LLM Takes It All: A Truly Fully End-to-End Spoken Dialogue State Tracking Approach

📅 2025-10-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses context modeling in end-to-end spoken dialogue state tracking (DST), investigating how different speech-context representations affect performance. We propose and empirically validate— for the first time—that feeding the full speech history (rather than text-only history or current-turn speech alone) into a Speech-LLM significantly improves tracking accuracy. To mitigate computational overhead, we further introduce an attention-based pooling mechanism to compress the speech history while preserving discriminative information. Systematic evaluation on SpokenWOZ demonstrates that the full-speech-input variant achieves state-of-the-art accuracy among models of comparable scale, whereas the compressed variant attains the optimal trade-off between accuracy and inference efficiency. Our core contributions are: (i) establishing full speech history as an effective and empirically validated paradigm for spoken DST context modeling; and (ii) providing a lightweight, computationally efficient pathway for practical deployment.

Technology Category

Application Category

📝 Abstract
This paper presents a comparative study of context management strategies for end-to-end Spoken Dialog State Tracking using Speech-LLMs. We systematically evaluate traditional multimodal context (combining text history and spoken current turn), full spoken history, and compressed spoken history approaches. Our experiments on the SpokenWOZ corpus demonstrate that providing the full spoken conversation as input yields the highest performance among models of similar size, significantly surpassing prior methods. Furthermore, we show that attention-pooling-based compression of the spoken history offers a strong trade-off, maintaining competitive accuracy with reduced context size. Detailed analysis confirms that improvements stem from more effective context utilization.
Problem

Research questions and friction points this paper is trying to address.

Evaluating context management strategies for spoken dialogue state tracking
Comparing multimodal versus full spoken history input approaches
Optimizing spoken context compression for efficient dialogue processing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses full spoken conversation as input
Compresses spoken history with attention-pooling
Enhances context utilization for tracking
🔎 Similar Papers
No similar papers found.