Mind the Goal: Data-Efficient Goal-Oriented Evaluation of Conversational Agents and Chatbots using Teacher Models

📅 2025-10-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current dialogue evaluation predominantly operates at the turn level, failing to assess whether users’ overarching goals—such as policy inquiry or leave application—are successfully fulfilled. To address this, we propose a goal-centric evaluation framework: first identifying the user’s core intent, then modeling cross-turn coherence via a teacher large language model to precisely determine task completion. We introduce two novel components: (1) Goal Success Rate (GSR), a quantitative metric for end-to-end goal achievement; and (2) Root-Cause-of-Failure (RCOF), a taxonomy for classifying failure modes, augmented by chain-of-thought reasoning to generate interpretable, attributable evaluation traces. The framework integrates domain-expert-defined goal criteria, enabling data-efficient, fine-grained, and fully automated assessment. Deployed in the enterprise employee assistant system AIDA, our framework increased GSR from 63% to 79%, demonstrating its effectiveness and practicality in driving real-world system optimization.

Technology Category

Application Category

📝 Abstract
Evaluating the quality of multi-turn chatbot interactions remains challenging, as most existing methods assess interactions at the turn level without addressing whether a user's overarching goal was fulfilled. A ``goal'' here refers to an information need or task, such as asking for policy information or applying for leave. We propose a comprehensive framework for goal-oriented evaluation of multi-agent systems (MAS), introducing the extbf{Goal Success Rate (GSR)} to measure the percentage of fulfilled goals, and a extbf{Root Cause of Failure (RCOF)} taxonomy to identify reasons for failure in multi-agent chatbots. Our method segments conversations by user goals and evaluates success using all relevant turns. We present a model-based evaluation system combining teacher LLMs, where domain experts define goals, set quality standards serving as a guidance for the LLMs. The LLMs use ``thinking tokens'' to produce interpretable rationales, enabling extit{explainable}, extit{data-efficient} evaluations. In an enterprise setting, we apply our framework to evaluate AIDA, a zero-to-one employee conversational agent system built as a ground-up multi-agent conversational agent, and observe GSR improvement from 63% to 79% over six months since its inception. Our framework is generic and offers actionable insights through a detailed defect taxonomy based on analysis of failure points in multi-agent chatbots, diagnosing overall success, identifying key failure modes, and informing system improvements.
Problem

Research questions and friction points this paper is trying to address.

Evaluating chatbot goal fulfillment beyond turn-level interactions
Measuring goal success rate and identifying failure causes systematically
Providing explainable data-efficient evaluation using teacher LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Goal-oriented evaluation using teacher LLMs
Goal Success Rate and Root Cause taxonomy
Thinking tokens enable explainable rationales
🔎 Similar Papers
No similar papers found.
D
Deepak Babu Piskala
Amazon.com, Seattle, WA, USA
S
Sharlene Chen
Amazon.com, Seattle, WA, USA
Udita Patel
Udita Patel
Amazon.com
NLP
P
Parul Kalra
Amazon.com, Seattle, WA, USA
R
Rafael Castrillo
Amazon.com, Seattle, WA, USA