CodeGRAG: Bridging the Gap between Natural Language and Programming Language via Graphical Retrieval Augmented Generation

📅 2024-05-03
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
Bridging the semantic and syntactic gap between natural language and programming languages remains a key challenge for enhancing large language models’ (LLMs) code generation capability. Method: We propose a graphical retrieval-augmented generation (RAG) framework: (1) representing code blocks as structured graphs via control-flow graphs (CFGs) and data-flow graphs (DFGs); (2) designing a meta-graph-driven hard prompt template and a GNN-based soft prompt injection mechanism; and (3) introducing dual syntactic–semantic constraints to optimize graph representations and enable cross-lingual alignment. Contribution/Results: Our approach requires no instruction fine-tuning and achieves, for the first time, cross-language generalizable code generation gains. It significantly outperforms state-of-the-art RAG and fine-tuning baselines across multiple benchmarks—including HumanEval, MBPP, and MultiPL-E—demonstrating robustness across Python, Java, C++, and JavaScript. The framework advances LLM-based code synthesis by unifying structural program analysis with neural prompting in a language-agnostic manner.

Technology Category

Application Category

📝 Abstract
Utilizing large language models to generate codes has shown promising meaning in software development revolution. Despite the intelligence shown by the large language models, their specificity in code generation can still be improved due to the syntactic gap and mismatched vocabulary existing between natural language and programming languages. In this paper, we propose CodeGRAG, a Graphical Retrieval Augmented Code Generation framework that bridges the gap between NL and PL to enhance the performance of LLMs. CodeGRAG builds the graphical view of code blocks based on the control flow and data flow of them to better interpret the programming domain knowledge, which can facilitate natural language based LLMs for better understanding of code syntax and serve as a bridge among different programming languages. To take the extracted structural knowledge into the foundation models, we propose 1) a hard meta-graph prompt template to transform the challenging syntax graph into informative graphical view for tuning-free models and 2) a soft prompting technique that injects the domain knowledge of programming languages into model parameters via finetuning the models with the soft signals encoded by GNN expert model. Specifically, two constraints are designed to improve the alignment and structure expressiveness, contributing to the informativeness of the single-token-sized externalfor enhanced code generation. CodeGRAG significantly improves the code generation ability of LLMs and can even offer performance gain for cross-lingual code generation. Implementation is available at https://anonymous.4open.science/r/Code-5970/ .
Problem

Research questions and friction points this paper is trying to address.

Bridging syntactic gap between natural and programming languages
Enhancing LLM code generation via graphical retrieval
Improving cross-lingual code generation performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Graphical Retrieval Augmented Code Generation framework
Hard meta-graph prompt template for syntax graphs
Soft prompting technique with GNN expert model
🔎 Similar Papers
No similar papers found.
Kounianhua Du
Kounianhua Du
上海交通大学
Data ScienceLarge Language Models
J
Jizheng Chen
R
Renting Rui
Shanghai Jiao Tong University
H
Huacan Chai
Shanghai Jiao Tong University
Lingyue Fu
Lingyue Fu
Shanghai Jiao Tong University
Data Mining
W
Wei Xia
Huawei Noah’s Ark Lab
Yasheng Wang
Yasheng Wang
Tencent
Natural Language Processing
R
Ruiming Tang
Huawei Noah’s Ark Lab
Yong Yu
Yong Yu
Materials Engineer
Polymer matrix compositeadhesivemodelingtest development
W
Weinan Zhang
Shanghai Jiao Tong University