🤖 AI Summary
Large language models (LLMs) require scalable supervision for error detection in complex tasks, yet self-diagnosis is unreliable, and existing competitive multi-agent debate (MAD) approaches often induce misleading strategies—yielding performance worse than single-agent baselines.
Method: We propose Collaborative ColMAD, a non-zero-sum, cooperative debate protocol that restructures agent interaction around complementary critique, shared information exchange, and evidence-driven adjudication. It integrates a collaborative critique-feedback architecture with evidence-aggregated collective decision-making.
Contribution/Results: ColMAD significantly improves detection of subtle errors by fostering constructive alignment among agents. Empirical evaluation shows a 19% absolute improvement over competitive MAD on error detection tasks and consistent superiority over single-agent baselines. These results validate the efficacy and robustness of cooperative paradigms for scalable LLM supervision.
📝 Abstract
Accurate detection of errors in large language models (LLM) responses is central to the success of scalable oversight, or providing effective supervision to superhuman intelligence. Yet, self-diagnosis is often unreliable on complex tasks unless aided by reliable external feedback. Multi-agent debate (MAD) seems to be a natural alternative to external feedback: multiple LLMs provide complementary perspectives and cross-checks for error detection. However, prior MAD protocols frame debate as a zero-sum game, where the debaters compete to win the game instead of seeking the truth. Consequently, it leads to debate hacking: debaters tend to mislead the judge by misinterpreting the task or presenting overconfident claims, which introduce more mistakes and underperform single-agent methods. To mitigate the issue, we introduce a new collaborative MAD protocol, termed ColMAD, that reframes MAD as a non-zero sum game. Specifically, ColMAD encourages multiple agents to criticize each other in a supportive way, such that they can complement the missing points of each other. Therefore, the judge agent can make a more informative conclusion based on more comprehensive evidence. Empirically, we show that ColMAD significantly outperforms previous competitive MAD by 19% and brings non-trivial improvements over single-agent methods in error detection.