🤖 AI Summary
Existing SLU datasets suffer from limited scenario diversity, simplistic intent structures, and the absence of a unified, large-model-oriented evaluation benchmark. To address these limitations, we introduce MAC-SLU—the first multi-intent spoken language understanding dataset specifically designed for in-vehicle cabin environments—covering realistic complex interactions, context dependency, and concurrent multi-intent utterances. MAC-SLU supports both end-to-end and pipeline-based SLU paradigms and establishes the first unified, fine-grained evaluation benchmark for large language models (LLMs) and large audio-language models (LALMs) in the in-vehicle domain. Experimental results demonstrate that supervised fine-tuning substantially outperforms zero-shot in-context learning; moreover, end-to-end LALMs achieve performance comparable to pipeline methods while effectively mitigating ASR error propagation. This work fills a critical gap in multi-intent in-vehicle SLU benchmarking and advances the rigorous evaluation and practical deployment of foundation models in real-world speech understanding tasks.
📝 Abstract
Spoken Language Understanding (SLU), which aims to extract user semantics to execute downstream tasks, is a crucial component of task-oriented dialog systems. Existing SLU datasets generally lack sufficient diversity and complexity, and there is an absence of a unified benchmark for the latest Large Language Models (LLMs) and Large Audio Language Models (LALMs). This work introduces MAC-SLU, a novel Multi-Intent Automotive Cabin Spoken Language Understanding Dataset, which increases the difficulty of the SLU task by incorporating authentic and complex multi-intent data. Based on MAC-SLU, we conducted a comprehensive benchmark of leading open-source LLMs and LALMs, covering methods like in-context learning, supervised fine-tuning (SFT), and end-to-end (E2E) and pipeline paradigms. Our experiments show that while LLMs and LALMs have the potential to complete SLU tasks through in-context learning, their performance still lags significantly behind SFT. Meanwhile, E2E LALMs demonstrate performance comparable to pipeline approaches and effectively avoid error propagation from speech recognition. Codefootnote{https://github.com/Gatsby-web/MAC_SLU} and datasetsfootnote{huggingface.co/datasets/Gatsby1984/MAC_SLU} are released publicly.