🤖 AI Summary
This work addresses the limited reasoning capability of speech large language models (speechLLMs) in end-to-end slot filling. Methodologically, it proposes a hybrid modeling paradigm integrating chain-of-thought (CoT) reasoning: (1) constructing a reasoning-oriented annotation dataset tailored for speech understanding; (2) designing a hybrid speechLLM architecture supporting both direct prediction and multi-step reasoning modes; and (3) performing supervised fine-tuning on diverse-scale text-based LLM backbones to jointly optimize both inference paths. The key contribution lies in the first systematic empirical validation of the transfer limitations of pure-text CoT models in speech domains, demonstrating that explicit incorporation of intermediate reasoning steps significantly improves slot filling accuracy. Experiments across multiple benchmarks show consistent superiority of the proposed hybrid model over single-mode baselines, establishing a new, interpretable, and scalable paradigm for spoken language semantic parsing.
📝 Abstract
We propose integration of reasoning into speech large language models (speechLLMs) for the end-to-end slot-filling task. Inspired by the recent development of reasoning LLMs, we use a chain-of-thought framework to decompose the slot-filling task into multiple reasoning steps, create a reasoning dataset and apply the supervised fine-tuning strategy to a speechLLM. We distinguish between regular and reasoning speechLLMs and experiment with different types and sizes of LLMs as their text foundation models. We demonstrate performance improvements by introducing reasoning (intermediate) steps. However, we show that a reasoning textual LLM developed mainly for math, logic and coding domains might be inferior as a foundation model for a reasoning speechLLM. We further show that hybrid speechLLMs, built on a hybrid text foundation LLM and fine-tuned to preserve both direct and reasoning modes of operation, have better performance than those fine-tuned employing only one mode of operation.