๐ค AI Summary
This work addresses the limitations of large language models (LLMs) in complex reasoning tasks, where reliance on a single generate-and-select pipeline constrains performance and existing ensemble methods lack theoretical guarantees. The authors propose a multi-agent reasoning framework grounded in aligned delegation games: a principal delegates tasks to multiple agents through incentive mechanisms to generate candidate solutions and then selects the final answer, thereby enabling structured interaction and objective alignment. This approach provides the first theoretical guarantee for performance improvement in multi-agent LLM reasoning, relaxing the independence assumption while explicitly accounting for correlations among candidate solutionsโthus overcoming key limitations of conventional ensembles. Extensive experiments demonstrate that the framework consistently outperforms strong single-agent and ensemble baselines across diverse reasoning benchmarks, confirming its effectiveness and robustness.
๐ Abstract
LLMs often underperform on complex reasoning tasks when relying on a single generation-and-selection pipeline. Inference-time ensemble methods can improve performance by sampling diverse reasoning paths or aggregating multiple candidate answers, but they typically treat candidates independently and provide no formal guarantees that ensembling improves reasoning quality. We propose a novel method, Aligned Delegation for Multi-Agent LLM Reasoning (ALIGN), which formulates LLM reasoning as an aligned delegation game. In ALIGN, a principal delegates a task to multiple agents that generate candidate solutions under designed incentives, and then selects among their outputs to produce a final answer. This formulation induces structured interaction among agents while preserving alignment between agent and principal objectives. We establish theoretical guarantees showing that, under a fair comparison with equal access to candidate solutions, ALIGN provably improves expected performance over single-agent generation. Our analysis accommodates correlated candidate answers and relaxes independence assumptions that are commonly used in prior work. Empirical results across a broad range of LLM reasoning benchmarks consistently demonstrate that ALIGN outperforms strong single-agent and ensemble baselines.