🤖 AI Summary
Current dialogue evaluation predominantly operates at the turn level, failing to assess whether users’ overarching goals—such as policy inquiry or leave application—are successfully fulfilled. To address this, we propose a goal-centric evaluation framework: first identifying the user’s core intent, then modeling cross-turn coherence via a teacher large language model to precisely determine task completion. We introduce two novel components: (1) Goal Success Rate (GSR), a quantitative metric for end-to-end goal achievement; and (2) Root-Cause-of-Failure (RCOF), a taxonomy for classifying failure modes, augmented by chain-of-thought reasoning to generate interpretable, attributable evaluation traces. The framework integrates domain-expert-defined goal criteria, enabling data-efficient, fine-grained, and fully automated assessment. Deployed in the enterprise employee assistant system AIDA, our framework increased GSR from 63% to 79%, demonstrating its effectiveness and practicality in driving real-world system optimization.
📝 Abstract
Evaluating the quality of multi-turn chatbot interactions remains challenging, as most existing methods assess interactions at the turn level without addressing whether a user's overarching goal was fulfilled. A ``goal'' here refers to an information need or task, such as asking for policy information or applying for leave. We propose a comprehensive framework for goal-oriented evaluation of multi-agent systems (MAS), introducing the extbf{Goal Success Rate (GSR)} to measure the percentage of fulfilled goals, and a extbf{Root Cause of Failure (RCOF)} taxonomy to identify reasons for failure in multi-agent chatbots. Our method segments conversations by user goals and evaluates success using all relevant turns. We present a model-based evaluation system combining teacher LLMs, where domain experts define goals, set quality standards serving as a guidance for the LLMs. The LLMs use ``thinking tokens'' to produce interpretable rationales, enabling extit{explainable}, extit{data-efficient} evaluations. In an enterprise setting, we apply our framework to evaluate AIDA, a zero-to-one employee conversational agent system built as a ground-up multi-agent conversational agent, and observe GSR improvement from 63% to 79% over six months since its inception. Our framework is generic and offers actionable insights through a detailed defect taxonomy based on analysis of failure points in multi-agent chatbots, diagnosing overall success, identifying key failure modes, and informing system improvements.