Faithful, Unfaithful or Ambiguous? Multi-Agent Debate with Initial Stance for Summary Evaluation

📅 2025-02-12

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This paper addresses the susceptibility of LLM-based factual consistency evaluators to fluency bias and their difficulty in detecting factual errors. To this end, we propose a multi-agent debate framework. Methodologically, it integrates a multi-agent system, stance-driven protocols, multi-round reasoning, and consensus mechanisms. Key contributions include: (1) the first introduction of a “stance initialization” mechanism, which enforces agents to generate and exchange arguments from randomly assigned initial stances, thereby enhancing debate diversity; and (2) the definition of “ambiguity” as a novel evaluation dimension, operationalized via a fine-grained taxonomy that transcends traditional binary faithful/unfaithful classification. Experiments demonstrate that our framework significantly improves error detection rates across multiple benchmarks, accurately identifies ambiguous cases, and outperforms existing state-of-the-art methods on non-ambiguous summaries.

Technology Category

Application Category

📝 Abstract

Faithfulness evaluators based on large language models (LLMs) are often fooled by the fluency of the text and struggle with identifying errors in the summaries. We propose an approach to summary faithfulness evaluation in which multiple LLM-based agents are assigned initial stances (regardless of what their belief might be) and forced to come up with a reason to justify the imposed belief, thus engaging in a multi-round debate to reach an agreement. The uniformly distributed initial assignments result in a greater diversity of stances leading to more meaningful debates and ultimately more errors identified. Furthermore, by analyzing the recent faithfulness evaluation datasets, we observe that naturally, it is not always the case for a summary to be either faithful to the source document or not. We therefore introduce a new dimension, ambiguity, and a detailed taxonomy to identify such special cases. Experiments demonstrate our approach can help identify ambiguities, and have even a stronger performance on non-ambiguous summaries.

Problem

Research questions and friction points this paper is trying to address.

Improving summary faithfulness evaluation

Introducing ambiguity dimension in evaluation

Enhancing error detection through multi-agent debate

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent debate for summary evaluation

Initial stance assignments enhance diversity

Introduce ambiguity dimension in evaluation

🔎 Similar Papers

DebUnc: Mitigating Hallucinations in Large Language Model Agent Communication with Uncertainty Estimations