Towards Scalable Oversight with Collaborative Multi-Agent Debate in Error Detection

📅 2025-10-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) require scalable supervision for error detection in complex tasks, yet self-diagnosis is unreliable, and existing competitive multi-agent debate (MAD) approaches often induce misleading strategies—yielding performance worse than single-agent baselines. Method: We propose Collaborative ColMAD, a non-zero-sum, cooperative debate protocol that restructures agent interaction around complementary critique, shared information exchange, and evidence-driven adjudication. It integrates a collaborative critique-feedback architecture with evidence-aggregated collective decision-making. Contribution/Results: ColMAD significantly improves detection of subtle errors by fostering constructive alignment among agents. Empirical evaluation shows a 19% absolute improvement over competitive MAD on error detection tasks and consistent superiority over single-agent baselines. These results validate the efficacy and robustness of cooperative paradigms for scalable LLM supervision.

Technology Category

Application Category

📝 Abstract
Accurate detection of errors in large language models (LLM) responses is central to the success of scalable oversight, or providing effective supervision to superhuman intelligence. Yet, self-diagnosis is often unreliable on complex tasks unless aided by reliable external feedback. Multi-agent debate (MAD) seems to be a natural alternative to external feedback: multiple LLMs provide complementary perspectives and cross-checks for error detection. However, prior MAD protocols frame debate as a zero-sum game, where the debaters compete to win the game instead of seeking the truth. Consequently, it leads to debate hacking: debaters tend to mislead the judge by misinterpreting the task or presenting overconfident claims, which introduce more mistakes and underperform single-agent methods. To mitigate the issue, we introduce a new collaborative MAD protocol, termed ColMAD, that reframes MAD as a non-zero sum game. Specifically, ColMAD encourages multiple agents to criticize each other in a supportive way, such that they can complement the missing points of each other. Therefore, the judge agent can make a more informative conclusion based on more comprehensive evidence. Empirically, we show that ColMAD significantly outperforms previous competitive MAD by 19% and brings non-trivial improvements over single-agent methods in error detection.
Problem

Research questions and friction points this paper is trying to address.

Detecting errors in large language model responses for scalable oversight
Addressing debate hacking in multi-agent systems through collaboration
Improving error detection accuracy with supportive agent criticism
Innovation

Methods, ideas, or system contributions that make the work stand out.

Collaborative multi-agent debate reframes as non-zero sum game
Agents criticize each other in supportive way for complementarity
Judge makes informed conclusion using comprehensive evidence from debate
🔎 Similar Papers
No similar papers found.