🤖 AI Summary
Evaluating multi-turn goal-driven interactions faces challenges including scarce annotations, intractable human evaluation scalability, inability of existing metrics to detect unknown errors, and unreliability of LLM-based self-assessment.
Method: This paper proposes the first unsupervised, automated evaluation framework tailored to this setting. It extracts statistical features from interaction logs via large language models (LLMs), then jointly performs unsupervised goal clustering, distributional shift adaptation, and uncertainty quantification to end-to-end assess goal recognition accuracy, goal completion rate, and model confidence—without requiring gold responses or human annotations.
Contribution/Results: We introduce the first unsupervised evaluation metric suite for goal-driven interactions and leverage fine-tuned LLMs to capture implicit failure patterns. Experiments on open-domain and task-oriented benchmarks demonstrate that our framework significantly improves assessment reliability and scalability, accurately detects previously unseen errors, and quantifies interaction quality with calibrated uncertainty estimates.
📝 Abstract
Large language models (LLMs) have seen increasing popularity in enterprise applications where AI agents and humans engage in objective-driven interactions. However, these systems are difficult to evaluate: data may be complex and unlabeled; human annotation is often impractical at scale; custom metrics can monitor for specific errors, but not previously-undetected ones; and LLM judges can produce unreliable results. We introduce the first set of unsupervised metrics for objective-driven interactions, leveraging statistical properties of unlabeled interaction data and using fine-tuned LLMs to adapt to distributional shifts. We develop metrics for labeling user goals, measuring goal completion, and quantifying LLM uncertainty without grounding evaluations in human-generated ideal responses. Our approach is validated on open-domain and task-specific interaction data.