🤖 AI Summary
Existing prompting paradigms, such as Chain-of-Thought (CoT) and Tree-of-Thought (ToT), are constrained by linear or tree-like structures and struggle to effectively handle complex reasoning tasks that require integrating intermediate results, backtracking hypotheses, or synthesizing evidence from multiple sources. This work proposes the Network-of-Thought (NoT) framework, which, for the first time, models reasoning as a directed graph with typed nodes and edges, and introduces an LLM-generated heuristic controller to guide graph-based search. NoT enables multi-path merging, backtracking, and evidence integration, achieving state-of-the-art accuracy of 91.5% on GSM8K and 91.7% on HotpotQA using a 72B open-source model. On ProofWriter, it significantly outperforms fixed or random strategies with 57.0% accuracy, demonstrating the efficacy of graph-structured reasoning combined with self-generated heuristics.
📝 Abstract
Existing prompting paradigms structure LLM reasoning in limited topologies: Chain-of-Thought (CoT) produces linear traces, while Tree-of-Thought (ToT) performs branching search. Yet complex reasoning often requires merging intermediate results, revisiting hypotheses, and integrating evidence from multiple sources. We propose Network-of-Thought (NoT), a framework that models reasoning as a directed graph with typed nodes and edges, guided by a heuristic-based controller policy. Across four benchmarks (GSM8K, Game of 24, HotpotQA, ProofWriter) and three models (GPT-4o-mini, Llama-3.3-70B-Instruct, Qwen2.5-72B-Instruct), we investigate when network topology outperforms chain or tree structures, whether LLM-generated heuristics can guide graph-based reasoning search, and the computation-accuracy tradeoff across topologies, evaluating each method on accuracy, topology simplicity, and token efficiency. Our results show that CoT remains effective for sequential tasks with GPT-4o-mini (89.5\% on GSM8K), while NoT surpasses ToT on multi-hop reasoning (91.0\% vs.\ 88.0\% on HotpotQA with LLM-as-Judge). With 72B open-source models, NoT achieves the highest accuracy on GSM8K (91.5\%), and Qwen2.5-72B achieves the best multi-hop QA result overall (91.7\% on HotpotQA). Self-generated controller heuristics outperform fixed and random strategies on logical reasoning, with uncertainty-only weighting achieving 57.0\% on ProofWriter. We also find that evaluation methodology significantly impacts method rankings: string-match underestimates all methods on open-ended QA, with the largest gap for NoT, a pattern consistent across all three models (14--18 percentage point gap on HotpotQA).