MAD-Fact: A Multi-Agent Debate Framework for Long-Form Factuality Evaluation in LLMs

📅 2025-10-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Evaluating factual accuracy in long-text generations by large language models (LLMs) remains challenging in high-stakes domains (e.g., biomedicine, law), where hallucinations can incur severe consequences. Method: This paper proposes a multi-agent debate framework comprising: (1) LongHalluQA—the first Chinese benchmark for long-text factual evaluation; (2) a hierarchical, claim-importance-aware fine-grained assessment system; and (3) an integrated pipeline combining long-text decomposition, collaborative-competitive multi-agent verification, self-consistency analysis, and importance-weighted scoring. Results: Experiments demonstrate substantial improvements in systematicity and reliability of long-text factual evaluation. Notably, domestic Chinese LLMs exhibit superior factual consistency over leading international models on Chinese long-text tasks. The core contribution lies in pioneering the integration of adversarial debate mechanisms into long-text fact verification and enabling dynamic, importance-aware weighted assessment—advancing both methodological rigor and practical applicability in high-fidelity LLM evaluation.

Technology Category

Application Category

📝 Abstract
The widespread adoption of Large Language Models (LLMs) raises critical concerns about the factual accuracy of their outputs, especially in high-risk domains such as biomedicine, law, and education. Existing evaluation methods for short texts often fail on long-form content due to complex reasoning chains, intertwined perspectives, and cumulative information. To address this, we propose a systematic approach integrating large-scale long-form datasets, multi-agent verification mechanisms, and weighted evaluation metrics. We construct LongHalluQA, a Chinese long-form factuality dataset; and develop MAD-Fact, a debate-based multi-agent verification system. We introduce a fact importance hierarchy to capture the varying significance of claims in long-form texts. Experiments on two benchmarks show that larger LLMs generally maintain higher factual consistency, while domestic models excel on Chinese content. Our work provides a structured framework for evaluating and enhancing factual reliability in long-form LLM outputs, guiding their safe deployment in sensitive domains.
Problem

Research questions and friction points this paper is trying to address.

Evaluating factual accuracy in long-form LLM outputs across domains
Addressing limitations of existing methods on complex reasoning chains
Developing multi-agent verification for cumulative information in texts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent debate framework for factuality evaluation
Weighted metrics with fact importance hierarchy
Large-scale long-form dataset integration
🔎 Similar Papers
No similar papers found.
Y
Yucheng Ning
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China.
Xixun Lin
Xixun Lin
Institute of Information Engineering, Chinese Academy of Sciences
Data miningGraph representation learningLarge language model
F
Fang Fang
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China.
Yanan Cao
Yanan Cao
Institute of Information Engineering, Chinese Academy of Sciences