🤖 AI Summary
Existing Graph-CoT methods for knowledge graph reasoning suffer from low accuracy, excessive token consumption, high latency, and low throughput—stemming from monolithic agent prompting, redundant context encoding, and inefficient inference serving. This paper introduces the first multi-agent collaborative Graph-CoT framework, decoupling reasoning into four specialized agents: classification, graph retrieval, action generation, and logical reasoning. We further propose a graph-structure-aware KV cache management strategy, selective context sharing, priority-based cache eviction, and pipelined parallel execution. Experiments demonstrate that our approach achieves up to 38% higher accuracy, reduces token consumption by 95.7%, cuts inference latency by 90.3%, and improves throughput by 15.1× over state-of-the-art methods. These advances significantly enhance the efficiency, scalability, and practical deployability of complex graph reasoning systems.
📝 Abstract
Graph Chain-of-Thought (Graph-CoT) enables large language models (LLMs) to perform step-by-step reasoning over graph-structured knowledge, but existing pipelines suffer from low accuracy, excessive token usage, high latency, and low throughput due to single-agent monolithic prompts, repeated context re-encoding, and inefficient serving execution. We present GLM, the first multi-agent Graph-CoT system co-designed with an optimized LLM serving architecture. GLM decomposes reasoning into specialized agents for classification, reasoning, action generation, and graph retrieval, enabling branching and selective context sharing to reduce prompt length and reasoning iterations while preserving reasoning quality, thereby improving accuracy and reducing overall token consumption. To scale inference, we introduce a Graph-CoT-aware LLM inference mechanism with graph-specific KV-cache management, priority-based eviction, and pipelined execution to improve serving efficiency. Experiments demonstrate that GLM improves answer accuracy by up to 38%, reduces token cost by up to 95.7%, lowers inference latency by 90.3%, and achieves up to 15.1x higher throughput compared to state-of-the-art Graph-CoT baselines, enabling efficient adoption for complex real-world reasoning at scale.