🤖 AI Summary
This work addresses the vulnerability of multimodal agents to error propagation in speech-to-text cascaded pipelines and the absence of a unified end-to-end evaluation framework. To this end, we propose FOCAL—the first benchmark framework tailored for evaluating multimodal agents in spoken interaction. FOCAL features a modular architecture that integrates audio language models (ALMs), large language models (LLMs), and MCP servers, enabling comprehensive end-to-end reasoning evaluation and fine-grained error propagation tracing. The framework introduces two novel metrics—reasoning score and semantic score—to quantitatively assess spoken dialogue quality, supporting both automated and human-in-the-loop evaluation. Experimental results demonstrate that FOCAL systematically evaluates the effectiveness of multimodal agents under voice-based interaction, offering reliable insights for industrial-scale optimization.
📝 Abstract
With the recent advancements in reasoning capa- bilities, tool calling using MCP servers and Audio Language Models (ALMs), development and integration of multi-modal agents (with voice and text support) has come to the industry forefront. Cascading pipelines for voice agents still play a central role in the industry owing to their superior reasoning capabilities facilitated by LLMs. Although, cascading pipelines often present error propagation through the pipeline. We propose a framework, FOCAL to benchmark end-to-end reasoning, component-wise error propagation and error analysis for automated as well as human-assisted testing of multi-modal agents (voice to voice + text input). We also share two novel metrics viz. Reasoning and Semantic scores to evaluate efficacy of the agent in having meaningful conversations in voice mode.