🤖 AI Summary
Multi-agent reasoning systems suffer significant performance degradation on complex tasks and long-context scenarios, and their communication mechanisms lack a rigorous theoretical characterization of expressive power.
Method: We propose the first theoretical framework for analyzing the expressiveness of multi-agent reasoning, formally modeling three fundamental algorithmic paradigms—state tracking, memory retrieval, and k-hop reasoning—and establishing precise trade-offs among problem scale, number of agents, and communication bandwidth.
Contribution/Results: Through formal complexity analysis and synthetic LLM-based experiments on controlled benchmarks, we quantify both the benefits and limitations of communication: we prove that communication can exponentially reduce the required number of agents, while also deriving tight lower bounds on both agent count and total communication volume. Empirical results validate the predicted quantitative trade-offs among key variables, providing a foundational theoretical basis for the design and optimization of multi-agent reasoning systems.
📝 Abstract
Chain-of-thought prompting has popularized step-by-step reasoning in large language models, yet model performance still degrades as problem complexity and context length grow. By decomposing difficult tasks with long contexts into shorter, manageable ones, recent multi-agent paradigms offer a promising near-term solution to this problem. However, the fundamental capacities of such systems are poorly understood. In this work, we propose a theoretical framework to analyze the expressivity of multi-agent systems. We apply our framework to three algorithmic families: state tracking, recall, and $k$-hop reasoning. We derive bounds on (i) the number of agents required to solve the task exactly, (ii) the quantity and structure of inter-agent communication, and (iii) the achievable speedups as problem size and context scale. Our results identify regimes where communication is provably beneficial, delineate tradeoffs between agent count and bandwidth, and expose intrinsic limitations when either resource is constrained. We complement our theoretical analysis with a set of experiments on pretrained LLMs using controlled synthetic benchmarks. Empirical outcomes confirm the tradeoffs between key quantities predicted by our theory. Collectively, our analysis offers principled guidance for designing scalable multi-agent reasoning systems.