Codebase-Memory: Tree-Sitter-Based Knowledge Graphs for LLM Code Exploration via MCP

📅 2026-03-28

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

Current large language model (LLM) code agents rely heavily on repetitive file reads and grep-based searches, lacking structured program understanding and incurring substantial token costs. This work proposes a multi-stage pipeline grounded in the Model Context Protocol (MCP), which integrates Tree-Sitter syntactic parsing, call graph traversal, impact analysis, and community detection algorithms to construct a persistent, cross-language code knowledge graph supporting 66 programming languages. Evaluated on 31 real-world repositories, the approach achieves 83% of the answer quality of baseline methods while using only one-tenth the tokens and 2.1× fewer tool calls. Furthermore, on graph-native query tasks, it matches or outperforms conventional approaches in 19 out of the 31 evaluated languages.

Technology Category

Application Category

📝 Abstract

Large Language Model (LLM) coding agents typically explore codebases through repeated file-reading and grep-searching, consuming thousands of tokens per query without structural understanding. We present Codebase-Memory, an open-source system that constructs a persistent, Tree-Sitter-based knowledge graph via the Model Context Protocol (MCP), parsing 66 languages through a multi-phase pipeline with parallel worker pools, call-graph traversal, impact analysis, and community discovery. Evaluated across 31 real-world repositories, Codebase-Memory achieves 83% answer quality versus 92% for a file-exploration agent, at ten times fewer tokens and 2.1 times fewer tool calls. For graph-native queries such as hub detection and caller ranking, it matches or exceeds the explorer on 19 of 31 languages.

Problem

Research questions and friction points this paper is trying to address.

codebase exploration

Large Language Model

structural understanding

token efficiency

knowledge graph

Innovation

Methods, ideas, or system contributions that make the work stand out.

Codebase-Memory

Tree-Sitter

Knowledge Graph