🤖 AI Summary
Precisely locating relevant code units (files/classes/functions) in large-scale codebases given user queries, bug reports, or feature requests remains challenging. Existing dense retrieval methods neglect the inherent graph-structured relationships among code elements and lack effective contextual exploration.
Method: This paper proposes a spatially aware dense retrieval framework that integrates structured auxiliary context—generated via graph traversal—into the dense retrieval pipeline for the first time. It enhances semantic matching through LLM-driven spatial reasoning and comprises four components: code graph sampling, LLM-assisted contextualization, multilingual embedding fine-tuning, and re-ranking.
Contribution/Results: Evaluated on multilingual codebases, our method significantly outperforms BM25 and standard dense retrieval baselines, achieving an average 18.7% improvement in Recall@10. Results empirically validate the effectiveness of jointly leveraging graph-structural guidance and LLM-based spatial reasoning for code semantic retrieval.
📝 Abstract
Retrieving code units (e.g., files, classes, functions) that are semantically relevant to a given user query, bug report, or feature request from large codebases is a fundamental challenge for LLM-based coding agents. Agentic approaches typically employ sparse retrieval methods like BM25 or dense embedding strategies to identify relevant units. While embedding-based approaches can outperform BM25 by large margins, they often lack exploration of the codebase and underutilize its underlying graph structure. To address this, we propose SpIDER (Spatially Informed Dense Embedding Retrieval), an enhanced dense retrieval approach that incorporates LLM-based reasoning over auxiliary context obtained through graph-based exploration of the codebase. Empirical results show that SpIDER consistently improves dense retrieval performance across several programming languages.