Codebase-Memory: Tree-Sitter-Based Knowledge Graphs for LLM Code Exploration via MCP

📅 2026-03-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large language model (LLM) code agents rely heavily on repetitive file reads and grep-based searches, lacking structured program understanding and incurring substantial token costs. This work proposes a multi-stage pipeline grounded in the Model Context Protocol (MCP), which integrates Tree-Sitter syntactic parsing, call graph traversal, impact analysis, and community detection algorithms to construct a persistent, cross-language code knowledge graph supporting 66 programming languages. Evaluated on 31 real-world repositories, the approach achieves 83% of the answer quality of baseline methods while using only one-tenth the tokens and 2.1× fewer tool calls. Furthermore, on graph-native query tasks, it matches or outperforms conventional approaches in 19 out of the 31 evaluated languages.
📝 Abstract
Large Language Model (LLM) coding agents typically explore codebases through repeated file-reading and grep-searching, consuming thousands of tokens per query without structural understanding. We present Codebase-Memory, an open-source system that constructs a persistent, Tree-Sitter-based knowledge graph via the Model Context Protocol (MCP), parsing 66 languages through a multi-phase pipeline with parallel worker pools, call-graph traversal, impact analysis, and community discovery. Evaluated across 31 real-world repositories, Codebase-Memory achieves 83% answer quality versus 92% for a file-exploration agent, at ten times fewer tokens and 2.1 times fewer tool calls. For graph-native queries such as hub detection and caller ranking, it matches or exceeds the explorer on 19 of 31 languages.
Problem

Research questions and friction points this paper is trying to address.

codebase exploration
Large Language Model
structural understanding
token efficiency
knowledge graph
Innovation

Methods, ideas, or system contributions that make the work stand out.

Codebase-Memory
Tree-Sitter
Knowledge Graph
Model Context Protocol
LLM Code Exploration
🔎 Similar Papers
No similar papers found.
M
Martin Vogel
Independent Researcher, Berlin, Germany
F
Falk Meyer-Eschenbach
Institute of Medical Informatics, Charité – Universitätsmedizin Berlin, Berlin, Germany
S
Severin Kohler
Institute of Informatics, Freie Universität Berlin, Berlin, Germany
Elias Grünewald
Elias Grünewald
PostDoc, Charité Berlin
Medical Data SciencePrivacy EngineeringInformation SystemsCloud Computing
F
Felix Balzer
Institute of Medical Informatics, Charité – Universitätsmedizin Berlin, Berlin, Germany