๐ค AI Summary
This work addresses the limitations of current large code models in repository-level code completion, which struggle to effectively leverage repository-specific context and domain knowledge. Traditional retrieval-augmented approaches often suffer from semantic mismatches between queries and target code and neglect reasoning information, leading to suboptimal performance. To overcome these challenges, we propose AlignCoder, a novel framework that constructs enhanced queries by generating multiple candidate completions to bridge the semantic gap. Furthermore, we introduce AlignRetriever, a reinforcement learningโdriven retriever that exploits reasoning cues embedded in the candidates to enable more accurate cross-file retrieval. Evaluated on CrossCodeEval and RepoEval, our method significantly outperforms existing baselines, achieving an 18.1% absolute improvement in exact match (EM) scores and demonstrating strong generalization across diverse code large language models and programming languages.
๐ Abstract
Repository-level code completion remains a challenging task for existing code large language models (code LLMs) due to their limited understanding of repository-specific context and domain knowledge. While retrieval-augmented generation (RAG) approaches have shown promise by retrieving relevant code snippets as cross-file context, they suffer from two fundamental problems: misalignment between the query and the target code in the retrieval process, and the inability of existing retrieval methods to effectively utilize the inference information. To address these challenges, we propose AlignCoder, a repository-level code completion framework that introduces a query enhancement mechanism and a reinforcement learning based retriever training method. Our approach generates multiple candidate completions to construct an enhanced query that bridges the semantic gap between the initial query and the target code. Additionally, we employ reinforcement learning to train an AlignRetriever that learns to leverage inference information in the enhanced query for more accurate retrieval. We evaluate AlignCoder on two widely-used benchmarks (CrossCodeEval and RepoEval) across five backbone code LLMs, demonstrating an 18.1% improvement in EM score compared to baselines on the CrossCodeEval benchmark. The results show that our framework achieves superior performance and exhibits high generalizability across various code LLMs and programming languages.