Deep Graph-Language Fusion for Structure-Aware Code Generation

📅 2026-05-05

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This work addresses the limitation of current large language models in effectively capturing the structured nature of code—such as control flow and data dependencies—due to their sequential processing paradigm. To overcome this, the authors propose CGFuse, a novel framework that, for the first time, deeply integrates fine-grained code graph representations (e.g., abstract syntax trees and data flow graphs) extracted via graph neural networks into the intermediate layers of pretrained language models at the token level. This approach enables structure-aware code generation while circumventing the information loss inherent in conventional prompt-based encoding or feature compression techniques. Experimental results demonstrate that CGFuse consistently enhances code generation performance across multiple state-of-the-art large language models, achieving BLEU score improvements of 10–16% and CodeBLEU gains of 6–11%.

📝 Abstract

Pre-trained Language Models (PLMs) have the potential to transform software development tasks. However, despite significant advances, current PLMs struggle to capture the structured and relational attributes of code, such as control flow and data dependencies. This limitation is rooted in an architectural mismatch: whereas code structure is best represented by graphs, transformer-based LLMs process input as sequential token patterns and therefore lack explicit structural awareness. While recent research has explored integrating graph-based code representations using techniques like graph feature extraction, retrieval-augmented generation, and prompt engineering, existing approaches suffer from information loss during dense feature extraction or prompt encoding; notably, the potential of deep, token-level fusion of graph features within model internals has not been systematically explored. In this paper, we initiate such an exploration by introducing CGFuse, a novel framework that enables token-level integration of graph-derived representations by infusing learned graph features directly into the intermediate layers of pre-trained language models. CGFuse combines a graph neural network (GNN) with a language model to explicitly preserve and exploit fine-grained structural information from code graphs, including abstract syntax trees and data-flow graphs. We systematically evaluate CGFuse across multiple LLMs, demonstrating up to 10-16% BLEU and 6-11% CodeBLEU improvements in code generation performance. These results highlight the potential of deep graph-PLM integration to advance the field toward more robust, capable AI-driven software development.

Problem

Research questions and friction points this paper is trying to address.

code generation

graph-language fusion

structural awareness

pre-trained language models

code structure

Innovation

Methods, ideas, or system contributions that make the work stand out.

graph-language fusion

token-level integration

code structure awareness