CodeGRAG: Bridging the Gap between Natural Language and Programming Language via Graphical Retrieval Augmented Generation

📅 2024-05-03

📈 Citations: 2

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Bridging the semantic and syntactic gap between natural language and programming languages remains a key challenge for enhancing large language models’ (LLMs) code generation capability. Method: We propose a graphical retrieval-augmented generation (RAG) framework: (1) representing code blocks as structured graphs via control-flow graphs (CFGs) and data-flow graphs (DFGs); (2) designing a meta-graph-driven hard prompt template and a GNN-based soft prompt injection mechanism; and (3) introducing dual syntactic–semantic constraints to optimize graph representations and enable cross-lingual alignment. Contribution/Results: Our approach requires no instruction fine-tuning and achieves, for the first time, cross-language generalizable code generation gains. It significantly outperforms state-of-the-art RAG and fine-tuning baselines across multiple benchmarks—including HumanEval, MBPP, and MultiPL-E—demonstrating robustness across Python, Java, C++, and JavaScript. The framework advances LLM-based code synthesis by unifying structural program analysis with neural prompting in a language-agnostic manner.

Technology Category

Application Category

📝 Abstract

Utilizing large language models to generate codes has shown promising meaning in software development revolution. Despite the intelligence shown by the large language models, their specificity in code generation can still be improved due to the syntactic gap and mismatched vocabulary existing between natural language and programming languages. In this paper, we propose CodeGRAG, a Graphical Retrieval Augmented Code Generation framework that bridges the gap between NL and PL to enhance the performance of LLMs. CodeGRAG builds the graphical view of code blocks based on the control flow and data flow of them to better interpret the programming domain knowledge, which can facilitate natural language based LLMs for better understanding of code syntax and serve as a bridge among different programming languages. To take the extracted structural knowledge into the foundation models, we propose 1) a hard meta-graph prompt template to transform the challenging syntax graph into informative graphical view for tuning-free models and 2) a soft prompting technique that injects the domain knowledge of programming languages into model parameters via finetuning the models with the soft signals encoded by GNN expert model. Specifically, two constraints are designed to improve the alignment and structure expressiveness, contributing to the informativeness of the single-token-sized externalfor enhanced code generation. CodeGRAG significantly improves the code generation ability of LLMs and can even offer performance gain for cross-lingual code generation. Implementation is available at https://anonymous.4open.science/r/Code-5970/ .

Problem

Research questions and friction points this paper is trying to address.

Bridging syntactic gap between natural and programming languages

Enhancing LLM code generation via graphical retrieval

Improving cross-lingual code generation performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Graphical Retrieval Augmented Code Generation framework

Hard meta-graph prompt template for syntax graphs

Soft prompting technique with GNN expert model

🔎 Similar Papers

CodeRAG-Bench: Can Retrieval Augment Code Generation?