π€ AI Summary
Existing benchmarks for voice-based agents primarily focus on passive responsiveness, failing to adequately assess their capacity for proactive intervention and monitoring. To address this gap, this work proposes ProVoice-Bench, the first evaluation framework specifically designed for proactive voice agents. The framework introduces four novel task categories that simulate active interaction scenarios and employs a multi-stage data synthesis pipeline to construct a high-quality benchmark dataset comprising 1,182 samples. Systematic evaluation of state-of-the-art multimodal large language models using this framework reveals significant deficiencies in current modelsβ ability to trigger actions appropriately and reason contextually, particularly manifesting as excessive triggering and logical inconsistencies. These findings highlight critical challenges in developing robust proactive voice interaction capabilities.
π Abstract
Recent advancements in LLM agents are gradually shifting from reactive, text-based paradigms toward proactive, multimodal interaction. However, existing benchmarks primarily focus on reactive responses, overlooking the complexities of proactive intervention and monitoring. To bridge this gap, we introduce ProVoice-Bench, the first evaluation framework specifically designed for proactive voice agents, featuring four novel tasks. By leveraging a multi-stage data synthesis pipeline, we curate 1,182 high-quality samples for rigorous testing. Our evaluation of state-of-the-art Multimodal LLMs reveals a significant performance gap, particularly regarding over-triggering and reasoning capabilities. These findings highlight the limitations of current models and offer a roadmap for developing more natural, context-aware proactive agents.