Auditing Multi-Agent LLM Reasoning Trees Outperforms Majority Vote and LLM-as-Judge

📅 2026-02-10

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This work addresses the limitations of traditional majority voting in multi-agent large language model systems, which often overlooks the evidential structure of reasoning paths and is susceptible to shared hallucinations. To overcome these issues, the authors propose AgentAuditor, a framework that explicitly models inter-agent consensus and disagreement by constructing reasoning trees and replaces global voting with a local conflict verification mechanism. Furthermore, they introduce an Anti-Consensus Preference Optimization (ACPO) strategy that leverages minority correct reasoning paths to rectify collective errors. Evaluated across five mainstream multi-agent setups, the method achieves up to a 5% accuracy improvement over majority voting and outperforms LLM-as-Judge by 3%, effectively transcending the constraints of existing aggregation mechanisms.

Technology Category

Application Category

📝 Abstract

Multi-agent systems (MAS) can substantially extend the reasoning capacity of large language models (LLMs), yet most frameworks still aggregate agent outputs with majority voting. This heuristic discards the evidential structure of reasoning traces and is brittle under the confabulation consensus, where agents share correlated biases and converge on the same incorrect rationale. We introduce AgentAuditor, which replaces voting with a path search over a Reasoning Tree that explicitly represents agreements and divergences among agent traces. AgentAuditor resolves conflicts by comparing reasoning branches at critical divergence points, turning global adjudication into efficient, localized verification. We further propose Anti-Consensus Preference Optimization (ACPO), which trains the adjudicator on majority-failure cases and rewards evidence-based minority selections over popular errors. AgentAuditor is agnostic to MAS setting, and we find across 5 popular settings that it yields up to 5% absolute accuracy improvement over a majority vote, and up to 3% over using LLM-as-Judge.

Problem

Research questions and friction points this paper is trying to address.

multi-agent systems

reasoning trees

majority vote

confabulation consensus

LLM reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reasoning Tree

AgentAuditor

Anti-Consensus Preference Optimization