Enhancing Repository-Level Code Generation with Integrated Contextual Information

📅 2024-06-05
🏛️ arXiv.org
📈 Citations: 10
Influential: 1
📄 PDF
🤖 AI Summary
To address insufficient cross-file context utilization in repository-level code generation—particularly the challenge of balancing general knowledge with fine-grained type dependencies in statically typed languages—this paper proposes a type-dependency-driven context enhancement method. Our approach integrates static-analysis-derived type dependency graphs (for Java and Rust) with multi-file retrieval results to construct semantically richer, structured prompts, thereby overcoming the locality limitations of conventional retrieval methods. The core contribution is the first-ever type-dependency-driven context integration mechanism, enabling principled, cross-file and cross-module modeling of structural knowledge. Evaluated on 199 Java and 90 Rust tasks, our method achieves up to a 17.35% improvement in pass@k over RepoCoder. Crucially, it demonstrates strong generalizability across diverse code-specific and general-purpose large language models.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have demonstrated remarkable capabilities in code generation tasks. However, repository-level code generation presents unique challenges, particularly due to the need to utilize information spread across multiple files within a repository. Existing retrieval-based approaches sometimes fall short as they are limited in obtaining a broader and deeper repository context. In this paper, we present CatCoder, a novel code generation framework designed for statically typed programming languages. CatCoder enhances repository-level code generation by integrating relevant code and type context. Specifically, it leverages static analyzers to extract type dependencies and merges this information with retrieved code to create comprehensive prompts for LLMs. To evaluate the effectiveness of CatCoder, we adapt and construct benchmarks that include 199 Java tasks and 90 Rust tasks. The results show that CatCoder outperforms the RepoCoder baseline by up to 17.35%, in terms of pass@k score. Furthermore, the generalizability of CatCoder is assessed using various LLMs, including both code-specialized models and general-purpose models. Our findings indicate consistent performance improvements across all models, which underlines the practicality of CatCoder.
Problem

Research questions and friction points this paper is trying to address.

Repository-level code generation struggles with multi-file context integration
Existing retrieval methods fail to capture comprehensive type dependencies
LLMs need better context integration for statically typed languages
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses static analyzers to extract type dependencies
Merges type context with retrieved code for prompts
Enhances repository-level code generation for typed languages
🔎 Similar Papers
No similar papers found.
Zhiyuan Pan
Zhiyuan Pan
Zhejiang University
X
Xing Hu
The State Key Laboratory of Blockchain and Data Security, Zhejiang University, China
X
Xin Xia
The State Key Laboratory of Blockchain and Data Security, Zhejiang University, China
Xiaohu Yang
Xiaohu Yang
National University of Defense Technology
Plasma physicsLaser-plasma interactionInertial confinement fusionCharged particle beam