🤖 AI Summary
Current large language models face significant limitations in complex programming tasks due to low-quality training data, static architectures, and weak reasoning capabilities, hindering their practical utility in software engineering. This work proposes a systematic approach to address these challenges: enhancing code data quality through CODA and CodeDenoise, designing syntax-guided LEAM-family models to improve architectural expressiveness, and boosting reasoning via muFiX prompting and the Specine agent framework. By integrating adversarial data augmentation, code denoising, syntax-aware modeling, advanced prompt engineering, and agent-driven reasoning, the proposed methodology substantially advances model performance on complex programming tasks, facilitating the effective deployment of large language models in real-world software development scenarios.
📝 Abstract
Recent advances in language models (LMs) have driven significant progress in various software engineering tasks. However, existing LMs still struggle with complex programming scenarios due to limitations in data quality, model architecture, and reasoning capability. This research systematically addresses these challenges through three complementary directions: (1) improving code data quality with a code difference-guided adversarial augmentation technique (CODA) and a code denoising technique (CodeDenoise); (2) enhancing model architecture via syntax-guided code LMs (LEAM and LEAM++); and (3) advancing model reasoning with a prompting technique (muFiX) and an agent-based technique (Specine). These techniques aim to promote the practical adoption of LMs in software development and further advance intelligent software engineering.