π€ AI Summary
In multi-hop question answering (MHQA), large language models (LLMs) suffer from limited single-turn output capacity, hindering reliable integration of dispersed, interdependent evidence under noisy conditions and thereby degrading single-step reasoning accuracy. To address this, we first derive a capacity-aware theoretical accuracy upper bound grounded in Fanoβs inequality, formally exposing the fundamental tension between task complexity and model capacity. Building on this insight, we propose InfoQAβa multi-call reasoning framework that decomposes tasks, explicitly models inter-evidence dependencies, and actively prunes reasoning trajectories to ensure stability and robustness in high-noise settings. Experiments demonstrate that the theoretically derived capacity curve closely aligns with empirical performance; InfoQA achieves significant accuracy gains across multiple high-noise MHQA benchmarks, while exhibiting strong robustness and scalability.
π Abstract
Multi-Hop Question Answering (MHQA) requires integrating dispersed, interdependent evidence through sequential reasoning under noise. This task is challenging for LLMs as they have a finite per-pass output capacity, beyond which the integration of task-relevant evidence proves unreliable. Consequently, the single-pass reasoning paradigm is inherently vulnerable to this capacity overflow. To formalize this bottleneck, our analysis establishes a Fano-style accuracy upper bound, defining a theoretical performance ceiling for single-pass LLMs. This bound reveals that accuracy inevitably collapses once task complexity exceeds model capacity, providing general principles for capacity-aware representation and structuring of MHQA in LLMs. Building on these principles, we introduce a proof-of-concept multi-call framework for MHQA, InfoQA. It ensures high per-step accuracy by combining capacity-aware task decomposition with active pruning of prior reasoning traces, keeping the information load within the single-pass limit. It further achieves robustness by a dependency-explicit workflow that enables precise control over the reasoning path. We construct a stringent and noise-rich benchmark to validate our theory and framework. Experimental results show that model behavior aligns with our predicted capacity curves while InfoQA achieves consistent performance improvements. We hope our work inspires more LLM multi-step reasoning methods: faGithub href{https://github.com/KaiyangWan/InfoQA}{InfoQA}.