Strengthening Programming Comprehension in Large Language Models through Code Generation

📅 2025-08-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing large language models (LLMs) exhibit strong performance on code-related tasks but demonstrate shallow understanding of fundamental programming concepts—such as data flow and control flow—leading to insufficient robustness in complex code reasoning. To address this, we propose a counterfactual code augmentation framework that explicitly exposes the causal relationships underlying program structure by generating semantically coherent yet logically perturbed counterfactual code samples. Integrated with concept-aware annotation and instruction tuning, our approach establishes a concept-grounded fine-tuning mechanism. Crucially, it requires no additional human annotation and is plug-and-play compatible with mainstream LLMs. Evaluated across multiple code understanding benchmarks—including CodeXGLUE, Refactory, and Coda—our method delivers consistent improvements, significantly enhancing deep program logic comprehension, reasoning robustness, and decision interpretability. This work introduces a novel paradigm for strengthening the foundational programming capabilities of LLMs.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have recently shown impressive results on diverse code-related tasks, benefiting from large-scale training and instruction tuning. However, studies reveal that their grasp of fundamental programming concepts, such as data flow and control flow, remains shallow, leading to fragile performance when code requires deeper reasoning. This limitation restricts the practical adoption of LLMs in real-world software development. To address this issue, this work introduces a counterfactual code augmentation framework combined with concept-aware tuning, designed to guide LLMs toward stronger conceptual understanding. Comprehensive evaluation across multiple models and benchmarks demonstrates the effectiveness of the proposed approach.
Problem

Research questions and friction points this paper is trying to address.

Enhancing LLMs' grasp of fundamental programming concepts
Addressing shallow understanding of data and control flow
Improving practical adoption of LLMs in software development
Innovation

Methods, ideas, or system contributions that make the work stand out.

Counterfactual code augmentation for deeper understanding
Concept-aware tuning to improve programming concepts
Enhanced LLMs performance in code reasoning
🔎 Similar Papers
No similar papers found.
X
Xiaoning Ren
University of Science and Technology of China
Q
Qiang Hu
School of Cyber Security, Tianjin University
W
Wei Ma
Singapore Management University
Y
Yan Li
University of Science and Technology of China
Y
Yao Zhang
School of Cyber Security, Tianjin University
Lingxiao Jiang
Lingxiao Jiang
Professor of Computer Science, Singapore Management University
Software EngineeringData MiningCyber SecurityProgramming LanguagesSystems
Yinxing Xue
Yinxing Xue
Research Professor, Chinese Academy of Sciences
Software EngineeringSoftware SecurityProgram AnalysisSearch Based Software Engineering