Scaling Graph Chain-of-Thought Reasoning: A Multi-Agent Framework with Efficient LLM Serving

📅 2025-11-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing Graph-CoT methods for knowledge graph reasoning suffer from low accuracy, excessive token consumption, high latency, and low throughput—stemming from monolithic agent prompting, redundant context encoding, and inefficient inference serving. This paper introduces the first multi-agent collaborative Graph-CoT framework, decoupling reasoning into four specialized agents: classification, graph retrieval, action generation, and logical reasoning. We further propose a graph-structure-aware KV cache management strategy, selective context sharing, priority-based cache eviction, and pipelined parallel execution. Experiments demonstrate that our approach achieves up to 38% higher accuracy, reduces token consumption by 95.7%, cuts inference latency by 90.3%, and improves throughput by 15.1× over state-of-the-art methods. These advances significantly enhance the efficiency, scalability, and practical deployability of complex graph reasoning systems.

Technology Category

Application Category

📝 Abstract
Graph Chain-of-Thought (Graph-CoT) enables large language models (LLMs) to perform step-by-step reasoning over graph-structured knowledge, but existing pipelines suffer from low accuracy, excessive token usage, high latency, and low throughput due to single-agent monolithic prompts, repeated context re-encoding, and inefficient serving execution. We present GLM, the first multi-agent Graph-CoT system co-designed with an optimized LLM serving architecture. GLM decomposes reasoning into specialized agents for classification, reasoning, action generation, and graph retrieval, enabling branching and selective context sharing to reduce prompt length and reasoning iterations while preserving reasoning quality, thereby improving accuracy and reducing overall token consumption. To scale inference, we introduce a Graph-CoT-aware LLM inference mechanism with graph-specific KV-cache management, priority-based eviction, and pipelined execution to improve serving efficiency. Experiments demonstrate that GLM improves answer accuracy by up to 38%, reduces token cost by up to 95.7%, lowers inference latency by 90.3%, and achieves up to 15.1x higher throughput compared to state-of-the-art Graph-CoT baselines, enabling efficient adoption for complex real-world reasoning at scale.
Problem

Research questions and friction points this paper is trying to address.

Improves accuracy and reduces token usage in graph reasoning
Addresses high latency and low throughput in LLM serving
Enables scalable multi-agent reasoning over graph-structured knowledge
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent framework decomposes reasoning into specialized agents
Graph-CoT-aware LLM inference with graph-specific KV-cache management
Priority-based eviction and pipelined execution improve serving efficiency
🔎 Similar Papers
No similar papers found.
C
Chengying Huan
Nanjing University
Z
Ziheng Meng
Nanjing University
Y
Yongchao Liu
Ant Group
Z
Zhengyi Yang
University of New South Wales
Y
Yun Zhu
Shanghai Artificial Intelligence Laboratory
Y
Yue Yun
Ant Group
S
Shipeng Li
Nanjing University
Rong Gu
Rong Gu
Mälardalen University
Formal MethodsMachine LearningAutonomous Systems
X
Xiabao Wu
Ant Group
H
Haitao Zhang
Ant Group
C
Chuntao Hong
Ant Group
S
Shaonan Ma
Tsinghua University
Guihai Chen
Guihai Chen
Professor of Computer Science
Computer Science and Technology
Chen Tian
Chen Tian
Prof. of Nanjing University
Data Center NetworkingNetwork Function VirtualisationContent Distribution