Unsupervised Evaluation of Multi-Turn Objective-Driven Interactions

📅 2025-11-04

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Evaluating multi-turn goal-driven interactions faces challenges including scarce annotations, intractable human evaluation scalability, inability of existing metrics to detect unknown errors, and unreliability of LLM-based self-assessment. Method: This paper proposes the first unsupervised, automated evaluation framework tailored to this setting. It extracts statistical features from interaction logs via large language models (LLMs), then jointly performs unsupervised goal clustering, distributional shift adaptation, and uncertainty quantification to end-to-end assess goal recognition accuracy, goal completion rate, and model confidence—without requiring gold responses or human annotations. Contribution/Results: We introduce the first unsupervised evaluation metric suite for goal-driven interactions and leverage fine-tuned LLMs to capture implicit failure patterns. Experiments on open-domain and task-oriented benchmarks demonstrate that our framework significantly improves assessment reliability and scalability, accurately detects previously unseen errors, and quantifies interaction quality with calibrated uncertainty estimates.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have seen increasing popularity in enterprise applications where AI agents and humans engage in objective-driven interactions. However, these systems are difficult to evaluate: data may be complex and unlabeled; human annotation is often impractical at scale; custom metrics can monitor for specific errors, but not previously-undetected ones; and LLM judges can produce unreliable results. We introduce the first set of unsupervised metrics for objective-driven interactions, leveraging statistical properties of unlabeled interaction data and using fine-tuned LLMs to adapt to distributional shifts. We develop metrics for labeling user goals, measuring goal completion, and quantifying LLM uncertainty without grounding evaluations in human-generated ideal responses. Our approach is validated on open-domain and task-specific interaction data.

Problem

Research questions and friction points this paper is trying to address.

Evaluating multi-turn objective-driven AI-human interactions without supervision

Addressing unreliable LLM judgments and impractical human annotations

Developing unsupervised metrics for goal completion and uncertainty quantification

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unsupervised metrics for objective-driven interactions

Fine-tuned LLMs adapt to distributional shifts

Metrics quantify goals completion and uncertainty

🔎 Similar Papers

No similar papers found.