Scaling Graph Chain-of-Thought Reasoning: A Multi-Agent Framework with Efficient LLM Serving

📅 2025-11-03

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Existing Graph-CoT methods for knowledge graph reasoning suffer from low accuracy, excessive token consumption, high latency, and low throughput—stemming from monolithic agent prompting, redundant context encoding, and inefficient inference serving. This paper introduces the first multi-agent collaborative Graph-CoT framework, decoupling reasoning into four specialized agents: classification, graph retrieval, action generation, and logical reasoning. We further propose a graph-structure-aware KV cache management strategy, selective context sharing, priority-based cache eviction, and pipelined parallel execution. Experiments demonstrate that our approach achieves up to 38% higher accuracy, reduces token consumption by 95.7%, cuts inference latency by 90.3%, and improves throughput by 15.1× over state-of-the-art methods. These advances significantly enhance the efficiency, scalability, and practical deployability of complex graph reasoning systems.

Technology Category

Application Category

📝 Abstract

Graph Chain-of-Thought (Graph-CoT) enables large language models (LLMs) to perform step-by-step reasoning over graph-structured knowledge, but existing pipelines suffer from low accuracy, excessive token usage, high latency, and low throughput due to single-agent monolithic prompts, repeated context re-encoding, and inefficient serving execution. We present GLM, the first multi-agent Graph-CoT system co-designed with an optimized LLM serving architecture. GLM decomposes reasoning into specialized agents for classification, reasoning, action generation, and graph retrieval, enabling branching and selective context sharing to reduce prompt length and reasoning iterations while preserving reasoning quality, thereby improving accuracy and reducing overall token consumption. To scale inference, we introduce a Graph-CoT-aware LLM inference mechanism with graph-specific KV-cache management, priority-based eviction, and pipelined execution to improve serving efficiency. Experiments demonstrate that GLM improves answer accuracy by up to 38%, reduces token cost by up to 95.7%, lowers inference latency by 90.3%, and achieves up to 15.1x higher throughput compared to state-of-the-art Graph-CoT baselines, enabling efficient adoption for complex real-world reasoning at scale.

Problem

Research questions and friction points this paper is trying to address.

Improves accuracy and reduces token usage in graph reasoning

Addresses high latency and low throughput in LLM serving

Enables scalable multi-agent reasoning over graph-structured knowledge

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent framework decomposes reasoning into specialized agents

Graph-CoT-aware LLM inference with graph-specific KV-cache management

Priority-based eviction and pipelined execution improve serving efficiency

🔎 Similar Papers

Scalable and Accurate Graph Reasoning with LLM-based Multi-Agents