RefuteBench 2.0 -- Agentic Benchmark for Dynamic Evaluation of LLM Responses to Refutation Instruction

📅 2025-02-25

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing evaluations of large language models (LLMs) lack systematic assessment of response quality, memory consistency, and long-term robustness under multi-turn user rebuttals. Method: We introduce the first dynamic benchmark dedicated to multi-turn rebuttal evaluation, proposing an agent-driven paradigm where LLM agents serve as both rebuttal generators and evaluators. Our framework supports both transient and persistent rebuttal instructions and enhances evaluation reliability via attention analysis and meta-evaluation alignment. Results: Experiments reveal that while mainstream models respond promptly to rebuttals, they suffer from long-range information forgetting and progressive performance degradation. Our generated rebuttals exhibit higher human-likeness, and automated evaluation scores correlate strongly with human annotations (Spearman’s ρ > 0.85). This work establishes a novel benchmark and interpretable analytical toolkit for advancing the reliability of interactive LLMs.

Technology Category

Application Category

📝 Abstract

In the multi-turn interaction schema, large language models (LLMs) can leverage user feedback to enhance the quality and relevance of their responses. However, evaluating an LLM's ability to incorporate user refutation feedback is crucial yet challenging. In this study, we introduce RefuteBench 2.0, which significantly extends the original RefuteBench by incorporating LLM agents as refuters and evaluators, which allows for flexible and comprehensive assessment. We design both transient and persistent refutation instructions with different validity periods. Meta-evaluation shows that the LLM-based refuter could generate more human-like refutations and the evaluators could assign scores with high correlation with humans. Experimental results of various LLMs show that current models could effectively satisfy the refutation but fail to memorize the refutation information. Interestingly, we also observe that the performance of the initial task decreases as the refutations increase. Analysis of the attention scores further shows a potential weakness of current LLMs: they struggle to retain and correctly use previous information during long context dialogues. https://github.com/ElliottYan/RefuteBench-2.0

Problem

Research questions and friction points this paper is trying to address.

Evaluate LLM response to refutation

Assess LLM memory of refutation

Analyze LLM performance in dialogues

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM agents as refuters

Transient and persistent instructions

Attention score analysis

🔎 Similar Papers

Adversarial Multi-Agent Evaluation of Large Language Models through Iterative Debates

2024-10-07arXiv.orgCitations: 11

RAD-Bench: Evaluating Large Language Models Capabilities in Retrieval Augmented Dialogues

2024-09-19arXiv.orgCitations: 0

Evaluating the Performance of Large Language Models via Debates

2024-06-16arXiv.orgCitations: 2

Authors to Follow