๐ค AI Summary
This work addresses the lack of comprehensive evaluation benchmarks for assessing the tool-use capabilities of speech language models (SpeechLMs), particularly with respect to domain coverage, acoustic diversity, and compositional reasoning. To bridge this gap, we introduce Audio2Toolโthe first holistic benchmark specifically designed to evaluate SpeechLMsโ tool-invocation proficiency across three real-world domains: smart automotive, home, and wearable devices. Audio2Tool comprises approximately 30,000 spoken queries of varying complexity, synthesized using zero-shot voice cloning, diverse noise injection, and high-fidelity audio generation to emulate realistic acoustic conditions. The benchmark incorporates challenging tasks such as multi-intent understanding and information extraction. Experimental results reveal that while current SpeechLMs perform adequately on simple commands, their performance degrades significantly under compositional semantics and noisy conditions, highlighting critical limitations in robustness and generalization.
๐ Abstract
Voice assistants increasingly rely on Speech Language Models (SpeechLMs) to interpret spoken queries and execute complex tasks, yet existing benchmarks lack domain breadth, acoustic diversity, and compositional reasoning complexity to evaluate tool-calling performance. We introduce Audio2Tool, a large-scale dataset comprising approximately 30,000 queries designed to assess tool-calling capabilities of SpeechLMs across three primary domains: Smart Car, Smart Home, and Wearables. Our benchmark features a multi-tier complexity hierarchy, ranging from simple direct commands to complex multi-intent and needle-in-a-haystack extraction to isolate distinct failure modes. To ensure realism, we employ zero-shot voice cloning text-to-speech synthesis and diverse noise profiles to simulate in-the-wild conditions. Evaluations of state-of-the-art SpeechLMs and ASR-LLM pipelines show strong performance on simple commands but significant degradation under compositional and acoustic challenges. We will release the dataset and benchmark upon acceptance.