Retrieval-Augmented Code Generation: A Survey with Focus on Repository-Level Approaches

📅 2025-10-06

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

This paper addresses core challenges in repository-level code generation (RLCG)—namely, difficulty in modeling long-range dependencies, cross-file semantic inconsistency, and weak global structural coherence. To this end, it introduces the first systematic, retrieval-augmented generation (RAG)-based unified analytical framework tailored for RLCG. The framework features a multi-granularity (function-/file-/repository-level) classification scheme that elucidates key mechanisms for cross-file dependency modeling and global consistency maintenance. It integrates context-aware generation, modular architecture design, and multi-source information retrieval to enhance both output quality and scalability. Furthermore, the paper comprehensively surveys prevailing datasets and evaluation benchmarks, identifying critical bottlenecks in dynamic context updating, semantic consistency assurance, and large-scale repository adaptation. Collectively, this work provides foundational theoretical insights and practical technical pathways for AI-driven software engineering.

Technology Category

Application Category

📝 Abstract

Recent advancements in large language models (LLMs) have substantially improved automated code generation. While function-level and file-level generation have achieved promising results, real-world software development typically requires reasoning across entire repositories. This gives rise to the challenging task of Repository-Level Code Generation (RLCG), where models must capture long-range dependencies, ensure global semantic consistency, and generate coherent code spanning multiple files or modules. To address these challenges, Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm that integrates external retrieval mechanisms with LLMs, enhancing context-awareness and scalability. In this survey, we provide a comprehensive review of research on Retrieval-Augmented Code Generation (RACG), with an emphasis on repository-level approaches. We categorize existing work along several dimensions, including generation strategies, retrieval modalities, model architectures, training paradigms, and evaluation protocols. Furthermore, we summarize widely used datasets and benchmarks, analyze current limitations, and outline key challenges and opportunities for future research. Our goal is to establish a unified analytical framework for understanding this rapidly evolving field and to inspire continued progress in AI-powered software engineering.

Problem

Research questions and friction points this paper is trying to address.

Addressing repository-level code generation challenges across multiple files

Enhancing context-awareness in code generation using retrieval-augmented approaches

Improving global semantic consistency in large-scale software development

Innovation

Methods, ideas, or system contributions that make the work stand out.

Retrieval-augmented generation enhances code generation

Repository-level approaches address multi-file dependencies

External retrieval mechanisms improve context-awareness and scalability

🔎 Similar Papers

CodeRAG-Bench: Can Retrieval Augment Code Generation?