🤖 AI Summary
This work challenges conventional multi-agent systems that rely on majority voting or hierarchical aggregation, which often treat consensus as a terminal goal and discard critical reasoning information. Instead, the authors propose aggregating complete reasoning trajectories as fundamental units, generating diverse trajectories through semantic-preserving input perturbations. Their approach integrates an anchoring refinement strategy with provable non-degeneracy guarantees to enable trajectory-level synthesis. Notably, it reveals a “aggregation paradox”: even when all agents converge on an incorrect answer, the correct solution can still be recovered from their collective reasoning traces. Experiments demonstrate that perturbation-induced trajectory variations from a single model significantly outperform ensembles of heterogeneous models across structured reasoning, doctoral-level scientific problems, competitive mathematics, and programming tasks, yielding substantial gains in accuracy.
📝 Abstract
When multiple LLM agents solve the same problem, standard practice compresses each agent's reasoning into a majority vote or layered synthesis, treating agreement as the finish line. We show this is unnecessarily lossy: an LLM aggregator that reads complete reasoning traces recovers correct solutions even when agents unanimously agree, with beneficial corrections consistently outweighing harmful ones -- the \emph{aggregation paradox}. Majority voting has a ceiling that perturbation diversity does not raise (error correlations are identical); the aggregator's gain comes from trace-level complementarity, assembling correct intermediate steps from minority chains that voting discards. These findings motivate Self-Consistent Mixture of Agents which generates trace diversity through semantic-preserving input perturbations, safeguards the majority via anchored refinement with provable non-degradation guarantees, and always synthesizes -- never gates on consensus. A single model with perturbation-induced trace variation outperforms heterogeneous model pools across structured reasoning, PhD-level science, competition mathematics, and competitive programming. The unit of aggregation should be the reasoning trace, not the answer.