Bridging Code Graphs and Large Language Models for Better Code Understanding

📅 2025-12-08

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

Existing large language models (LLMs) struggle to capture structural code semantics due to their reliance on linearized token sequences; mainstream enhancement approaches are either constrained by prompt length or require architectural modifications, limiting compatibility with frozen, instruction-tuned LLMs. This paper introduces CGBridge—a plug-and-play external Bridge module that aligns code, graph, and textual semantics via cross-modal attention, injecting structure-aware prompts generated by a pretrained code graph encoder into a frozen LLM. CGBridge requires no parameter updates or architectural changes to the target LLM, relying solely on a self-supervised pretrained graph encoder and a lightweight Bridge module. Evaluated on code summarization and translation tasks, it achieves up to a 38.87% improvement in execution accuracy and operates over four times faster than LoRA during inference. The method significantly enhances structural awareness while maintaining high deployment efficiency.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have demonstrated remarkable performance in code intelligence tasks such as code generation, summarization, and translation. However, their reliance on linearized token sequences limits their ability to understand the structural semantics of programs. While prior studies have explored graphaugmented prompting and structure-aware pretraining, they either suffer from prompt length constraints or require task-specific architectural changes that are incompatible with large-scale instructionfollowing LLMs. To address these limitations, this paper proposes CGBridge, a novel plug-and-play method that enhances LLMs with Code Graph information through an external, trainable Bridge module. CGBridge first pre-trains a code graph encoder via selfsupervised learning on a large-scale dataset of 270K code graphs to learn structural code semantics. It then trains an external module to bridge the modality gap among code, graph, and text by aligning their semantics through cross-modal attention mechanisms. Finally, the bridge module generates structure-informed prompts, which are injected into a frozen LLM, and is fine-tuned for downstream code intelligence tasks. Experiments show that CGBridge achieves notable improvements over both the original model and the graphaugmented prompting method. Specifically, it yields a 16.19% and 9.12% relative gain in LLM-as-a-Judge on code summarization, and a 9.84% and 38.87% relative gain in Execution Accuracy on code translation. Moreover, CGBridge achieves over 4x faster inference than LoRA-tuned models, demonstrating both effectiveness and efficiency in structure-aware code understanding.

Problem

Research questions and friction points this paper is trying to address.

Enhancing LLMs with structural code semantics from graphs

Bridging modality gaps between code, graph, and text representations

Improving code understanding tasks without altering LLM architecture

Innovation

Methods, ideas, or system contributions that make the work stand out.

Plug-and-play bridge module integrates code graphs with LLMs

Pre-trained graph encoder learns structural semantics via self-supervised learning

Cross-modal attention aligns code, graph, and text semantics

🔎 Similar Papers

GALLa: Graph Aligned Large Language Models for Improved Source Code Understanding