🤖 AI Summary
This work addresses context modeling in end-to-end spoken dialogue state tracking (DST), investigating how different speech-context representations affect performance. We propose and empirically validate— for the first time—that feeding the full speech history (rather than text-only history or current-turn speech alone) into a Speech-LLM significantly improves tracking accuracy. To mitigate computational overhead, we further introduce an attention-based pooling mechanism to compress the speech history while preserving discriminative information. Systematic evaluation on SpokenWOZ demonstrates that the full-speech-input variant achieves state-of-the-art accuracy among models of comparable scale, whereas the compressed variant attains the optimal trade-off between accuracy and inference efficiency. Our core contributions are: (i) establishing full speech history as an effective and empirically validated paradigm for spoken DST context modeling; and (ii) providing a lightweight, computationally efficient pathway for practical deployment.
📝 Abstract
This paper presents a comparative study of context management strategies for end-to-end Spoken Dialog State Tracking using Speech-LLMs. We systematically evaluate traditional multimodal context (combining text history and spoken current turn), full spoken history, and compressed spoken history approaches. Our experiments on the SpokenWOZ corpus demonstrate that providing the full spoken conversation as input yields the highest performance among models of similar size, significantly surpassing prior methods. Furthermore, we show that attention-pooling-based compression of the spoken history offers a strong trade-off, maintaining competitive accuracy with reduced context size. Detailed analysis confirms that improvements stem from more effective context utilization.