🤖 AI Summary
Current large language model (LLM) code agents rely heavily on repetitive file reads and grep-based searches, lacking structured program understanding and incurring substantial token costs. This work proposes a multi-stage pipeline grounded in the Model Context Protocol (MCP), which integrates Tree-Sitter syntactic parsing, call graph traversal, impact analysis, and community detection algorithms to construct a persistent, cross-language code knowledge graph supporting 66 programming languages. Evaluated on 31 real-world repositories, the approach achieves 83% of the answer quality of baseline methods while using only one-tenth the tokens and 2.1× fewer tool calls. Furthermore, on graph-native query tasks, it matches or outperforms conventional approaches in 19 out of the 31 evaluated languages.
📝 Abstract
Large Language Model (LLM) coding agents typically explore codebases through repeated file-reading and grep-searching, consuming thousands of tokens per query without structural understanding. We present Codebase-Memory, an open-source system that constructs a persistent, Tree-Sitter-based knowledge graph via the Model Context Protocol (MCP), parsing 66 languages through a multi-phase pipeline with parallel worker pools, call-graph traversal, impact analysis, and community discovery. Evaluated across 31 real-world repositories, Codebase-Memory achieves 83% answer quality versus 92% for a file-exploration agent, at ten times fewer tokens and 2.1 times fewer tool calls. For graph-native queries such as hub detection and caller ranking, it matches or exceeds the explorer on 19 of 31 languages.