SpIDER: Spatially Informed Dense Embedding Retrieval for Software Issue Localization

📅 2025-12-17

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

Precisely locating relevant code units (files/classes/functions) in large-scale codebases given user queries, bug reports, or feature requests remains challenging. Existing dense retrieval methods neglect the inherent graph-structured relationships among code elements and lack effective contextual exploration. Method: This paper proposes a spatially aware dense retrieval framework that integrates structured auxiliary context—generated via graph traversal—into the dense retrieval pipeline for the first time. It enhances semantic matching through LLM-driven spatial reasoning and comprises four components: code graph sampling, LLM-assisted contextualization, multilingual embedding fine-tuning, and re-ranking. Contribution/Results: Evaluated on multilingual codebases, our method significantly outperforms BM25 and standard dense retrieval baselines, achieving an average 18.7% improvement in Recall@10. Results empirically validate the effectiveness of jointly leveraging graph-structural guidance and LLM-based spatial reasoning for code semantic retrieval.

Technology Category

Application Category

📝 Abstract

Retrieving code units (e.g., files, classes, functions) that are semantically relevant to a given user query, bug report, or feature request from large codebases is a fundamental challenge for LLM-based coding agents. Agentic approaches typically employ sparse retrieval methods like BM25 or dense embedding strategies to identify relevant units. While embedding-based approaches can outperform BM25 by large margins, they often lack exploration of the codebase and underutilize its underlying graph structure. To address this, we propose SpIDER (Spatially Informed Dense Embedding Retrieval), an enhanced dense retrieval approach that incorporates LLM-based reasoning over auxiliary context obtained through graph-based exploration of the codebase. Empirical results show that SpIDER consistently improves dense retrieval performance across several programming languages.

Problem

Research questions and friction points this paper is trying to address.

Retrieving semantically relevant code units from large codebases

Improving dense embedding retrieval by exploring codebase structure

Enhancing software issue localization with graph-informed LLM reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Incorporates LLM-based reasoning for code retrieval

Uses graph-based exploration of codebase structure

Enhances dense retrieval with spatial information

🔎 Similar Papers

BLAZE: Cross-Language and Cross-Project Bug Localization via Dynamic Chunking and Hard Example Learning