๐ค AI Summary
This work addresses the information loss in Chain-of-Agents frameworks caused by fixed or heuristic ordering of textual chunks. To mitigate this issue, the study introduces Chow-Liu trees into long-context multi-agent reasoning for the first time, leveraging them to learn strong dependencies among text segments and construct a probabilistic dependency structure. Building upon this tree, the authors propose a breadth-first traversal strategy to dynamically determine the optimal processing order of chunks, thereby alleviating information bottlenecks during summary propagation. Experimental results across three long-context benchmarks demonstrate that the proposed method significantly outperforms both the default sequential ordering and semantic scoringโbased baselines in terms of answer relevance and exact match accuracy.
๐ Abstract
Sequential multi-agent reasoning frameworks such as Chain-of-Agents (CoA) handle long-context queries by decomposing inputs into chunks and processing them sequentially using LLM-based worker agents that read from and update a bounded shared memory. From a probabilistic perspective, CoA aims to approximate the conditional distribution corresponding to a model capable of jointly reasoning over the entire long context. CoA achieves this through a latent-state factorization in which only bounded summaries of previously processed evidence are passed between agents. The resulting bounded-memory approximation introduces a lossy information bottleneck, making the final evidence state inherently dependent on the order in which chunks are processed.
In this work, we study the problem of chunk ordering for long-context reasoning. We use the well-known Chow-Liu trees to learn a dependency structure that prioritizes strongly related chunks. Empirically, we show that a breadth-first traversal of the resulting tree yields chunk orderings that reduce information loss across agents and consistently outperform both default document-chunk ordering and semantic score-based ordering in answer relevance and exact-match accuracy across three long-context benchmarks.