SACL: Understanding and Combating Textual Bias in Code Retrieval with Semantic-Augmented Reranking and Localization

📅 2025-06-24

📈 Citations: 0

✨ Influential: 0

career value

149K/year

🤖 AI Summary

Existing code retrieval models for retrieval-augmented code generation (RACG) overly rely on superficial textual features—such as docstrings and identifiers—leading to bias toward syntactically well-documented but semantically irrelevant code, thereby degrading retrieval accuracy. To address this, we propose SACL, a Semantic-Aware re-ranking and Code Localization framework. SACL identifies sources of textual bias via masked analysis, integrates deep code semantic representations, structural knowledge, and context-aware re-ranking, and introduces a fine-grained localization mechanism to substantially reduce dependence on surface-level features. Evaluated on HumanEval, MBPP, and SWE-Bench-Lite, SACL achieves absolute improvements of +12.8%, +9.4%, and +7.0% in Recall@1, respectively, and boosts HumanEval Pass@1 by +4.88%. These results demonstrate SACL’s dual efficacy in enhancing both retrieval quality and downstream code generation performance.

Technology Category

Application Category

📝 Abstract

Retrieval-Augmented Code Generation (RACG) is a critical technique for enhancing code generation by retrieving relevant information. In this work, we conduct an in-depth analysis of code retrieval by systematically masking specific features while preserving code functionality. Our discoveries include: (1) although trained on code, current retrievers heavily rely on surface-level textual features (e.g., docstrings, identifier names), and (2) they exhibit a strong bias towards well-documented code, even if the documentation is irrelevant.Based on our discoveries, we propose SACL, a framework that enriches textual information and reduces bias by augmenting code or structural knowledge with semantic information. Extensive experiments show that SACL substantially improves code retrieval (e.g., by 12.8% / 9.4% / 7.0% Recall@1 on HumanEval / MBPP / SWE-Bench-Lite), which also leads to better code generation performance (e.g., by 4.88% Pass@1 on HumanEval).

Problem

Research questions and friction points this paper is trying to address.

Analyzes bias in code retrieval from textual features

Proposes SACL to reduce bias with semantic augmentation

Improves code retrieval and generation performance significantly

Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic-augmented reranking reduces textual bias

Localization enhances code retrieval accuracy

Structural knowledge integration improves generation performance

🔎 Similar Papers

CodeRAG-Bench: Can Retrieval Augment Code Generation?