When Retrieval Hurts Code Completion: A Diagnostic Study of Stale Repository Context

📅 2026-05-14

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

This study addresses the risk of generating outdated and incompatible code in retrieval-augmented code completion due to reliance on stale repository contexts. It introduces temporal validity as an independent diagnostic dimension for Code RAG and constructs a diagnostic dataset comprising 17 real-world Python function signature changes. Through controlled experiments on Qwen2.5-Coder-7B-Instruct and GPT-4.1-mini, the authors evaluate completion quality under four retrieval conditions: current-only, outdated-only, no retrieval, and mixed. Results show that outdated retrieval increases stale reference rates by 88.2 and 76.5 percentage points, respectively. While retrieval-free generation avoids stale references, it often fails to produce correct code. In contrast, mixed retrieval—incorporating up-to-date evidence—significantly mitigates errors, underscoring the critical role of temporal relevance in retrieval-augmented code generation.

📝 Abstract

Context: Retrieval-augmented code generation relies on cross-file repository context, but retrieved snippets may come from obsolete project states. Objectives: We study whether temporally stale repository snippets act as harmless noise or actively induce current-state-incompatible code. Methods: We conduct a controlled diagnostic study on a curated 17-sample set of production-helper signature changes from five Python repositories. For each sample, we compare current-only, stale-only, no-retrieval, and mixed current/stale retrieval conditions under prompts that hide commit freshness and expected current signatures. Results: Under neutralized prompts, stale-only retrieval induces stale helper references on 15/17 Qwen2.5-Coder-7B-Instruct samples and 13/17 gpt-4.1-mini samples, corresponding to 88.2 and 76.5 percentage-point increases over current-only retrieval. No retrieval produces zero stale references but only 1/17 passing completions. The two models share 75.0% Jaccard overlap among stale-triggering samples, and mixed conditions show that adding valid current evidence largely rescues stale-only failures. Conclusion: Temporal validity of retrieved repository context is a distinct diagnostic variable for Code RAG robustness: stale context can actively bias models toward obsolete repository state rather than merely removing useful evidence.

Problem

Research questions and friction points this paper is trying to address.

retrieval-augmented code generation

stale context

code completion

temporal validity

repository state

Innovation

Methods, ideas, or system contributions that make the work stand out.

temporal staleness

retrieval-augmented code generation

code completion