🤖 AI Summary
This work addresses the limitations of using a single large language model as an automatic evaluator, which is prone to inconsistent judgments and biases inherited from pretraining. To mitigate these issues, the authors propose CollabEval, a novel multi-agent evaluation framework centered on collaboration rather than adversarial interaction. The framework employs a three-stage process—initial assessment, iterative multi-round discussion, and final adjudication—augmented with a strategic consensus-checking mechanism to enhance both reliability and fairness while maintaining computational efficiency. Experimental results demonstrate that CollabEval significantly outperforms single-model baselines across multiple evaluation dimensions and exhibits robust performance even when individual agent capabilities are limited.
📝 Abstract
Large Language Models (LLMs) have revolutionized AI-generated content evaluation, with the LLM-as-a-Judge paradigm becoming increasingly popular. However, current single-LLM evaluation approaches face significant challenges, including inconsistent judgments and inherent biases from pre-training data. To address these limitations, we propose CollabEval, a novel multi-agent evaluation framework that implements a three-phase Collaborative Evaluation process: initial evaluation, multi-round discussion, and final judgment. Unlike existing approaches that rely on competitive debate or single-model evaluation, CollabEval emphasizes collaboration among multiple agents with strategic consensus checking for efficiency. Our extensive experiments demonstrate that CollabEval consistently outperforms single-LLM approaches across multiple dimensions while maintaining robust performance even when individual models struggle. The framework provides comprehensive support for various evaluation criteria while ensuring efficiency through its collaborative design.