Evaluating the Performance of Large Language Models via Debates

📅 2024-06-16

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Existing LLM evaluation methods suffer from heavy reliance on human annotation, domain specificity, and poor scalability. To address these limitations, this paper proposes the first automated evaluation framework based on multi-agent debate: a proposer LLM poses questions, multiple debater LLMs engage in multi-turn argumentation, and a judge LLM holistically assesses knowledge mastery, reasoning capability, and contradiction detection. The framework integrates structured debate prompt engineering, LLM self-judgment mechanisms, and consistency/logicality scoring models. Evaluated on multiple SOTA models, it achieves high alignment with human judgments (Spearman ρ > 0.92), substantially reduces evaluation cost, and demonstrates strong scalability and multidimensional dynamic assessment capability. Its core contribution lies in systematically introducing the debate paradigm into LLM evaluation—eliminating dependence on manual annotation and establishing a reproducible, generalizable automated evaluation paradigm.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are rapidly evolving and impacting various fields, necessitating the development of effective methods to evaluate and compare their performance. Most current approaches for performance evaluation are either based on fixed, domain-specific questions that lack the flexibility required in many real-world applications, or rely on human input, making them unscalable. To address these issues, we propose an automated benchmarking framework based on debates between LLMs, judged by another LLM. This method assesses not only domain knowledge, but also skills such as argumentative reasoning and inconsistency recognition. We evaluate the performance of various state-of-the-art LLMs using the debate framework and achieve rankings that align closely with popular rankings based on human input, eliminating the need for costly human crowdsourcing.

Problem

Research questions and friction points this paper is trying to address.

Evaluate LLM performance via automated debates

Assess argumentative reasoning and inconsistency skills

Align rankings with human input without crowdsourcing

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based automated benchmarking framework

Debate method evaluates multiple skills

Reduces need for human crowdsourcing

🔎 Similar Papers

Adversarial Multi-Agent Evaluation of Large Language Models through Iterative Debates