Amulet: Putting Complex Multi-Turn Conversations on the Stand with LLM Juries

📅 2025-05-26

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

To address the limited accuracy of large language models (LLMs) as evaluators in multi-turn dialogues, this paper proposes the first interpretable evaluation framework integrating speech act theory and Gricean conversational maxims. Methodologically, it innovatively models intent dynamics and principle adherence via multi-granularity context encoding and a heterogeneous LLM jury ensemble. Contributions include: (1) the first systematic incorporation of speech act recognition and quantitative modeling of the four Gricean maxims into evaluator design; (2) statistically significant improvements over state-of-the-art baselines across four challenging multi-turn dialogue datasets; (3) 75% of preference judgments attributable to speech act or maxim-based features, confirming explanatory power; and (4) empirical discovery that user intent evolves across turns in 60–70% of dialogues—a key phenomenon enabling cognitively grounded, interpretable LLM evaluation.

Technology Category

Application Category

📝 Abstract

Today, large language models are widely used as judges to evaluate responses from other language models. Hence, it is imperative to benchmark and improve these LLM-judges on real-world language model usage: a typical human-assistant conversation is lengthy, and shows significant diversity in topics, intents, and requirements across turns, e.g. social interactions, task requests, feedback. We present Amulet, a framework that leverages pertinent linguistic concepts of dialog-acts and maxims to improve the accuracy of LLM-judges on preference data with complex, multi-turn conversational context. Amulet presents valuable insights about (a) the communicative structures and intents present in the conversation (dialog acts), and (b) the satisfaction of conversational principles (maxims) by the preference responses, and uses them to make judgments. On four challenging datasets, Amulet shows that (a) humans frequently (60 to 70 percent of the time) change their intents from one turn of the conversation to the next, and (b) in 75 percent of instances, the preference responses can be differentiated via dialog acts and/or maxims, reiterating the latter's significance in judging such data. Amulet can be used either as a judge by applying the framework to a single LLM, or integrated into a jury with different LLM judges; our judges and juries show strong improvements on relevant baselines for all four datasets.

Problem

Research questions and friction points this paper is trying to address.

Improving LLM-judges' accuracy on multi-turn conversations

Analyzing dialog acts and maxims in complex conversational contexts

Enhancing preference response differentiation using linguistic principles

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages dialog-acts for intent analysis

Uses conversational maxims for response evaluation

Integrates LLM juries for improved judgments

🔎 Similar Papers

RAD-Bench: Evaluating Large Language Models Capabilities in Retrieval Augmented Dialogues