🤖 AI Summary
Existing code retrieval models for retrieval-augmented code generation (RACG) overly rely on superficial textual features—such as docstrings and identifiers—leading to bias toward syntactically well-documented but semantically irrelevant code, thereby degrading retrieval accuracy. To address this, we propose SACL, a Semantic-Aware re-ranking and Code Localization framework. SACL identifies sources of textual bias via masked analysis, integrates deep code semantic representations, structural knowledge, and context-aware re-ranking, and introduces a fine-grained localization mechanism to substantially reduce dependence on surface-level features. Evaluated on HumanEval, MBPP, and SWE-Bench-Lite, SACL achieves absolute improvements of +12.8%, +9.4%, and +7.0% in Recall@1, respectively, and boosts HumanEval Pass@1 by +4.88%. These results demonstrate SACL’s dual efficacy in enhancing both retrieval quality and downstream code generation performance.
📝 Abstract
Retrieval-Augmented Code Generation (RACG) is a critical technique for enhancing code generation by retrieving relevant information. In this work, we conduct an in-depth analysis of code retrieval by systematically masking specific features while preserving code functionality. Our discoveries include: (1) although trained on code, current retrievers heavily rely on surface-level textual features (e.g., docstrings, identifier names), and (2) they exhibit a strong bias towards well-documented code, even if the documentation is irrelevant.Based on our discoveries, we propose SACL, a framework that enriches textual information and reduces bias by augmenting code or structural knowledge with semantic information. Extensive experiments show that SACL substantially improves code retrieval (e.g., by 12.8% / 9.4% / 7.0% Recall@1 on HumanEval / MBPP / SWE-Bench-Lite), which also leads to better code generation performance (e.g., by 4.88% Pass@1 on HumanEval).