🤖 AI Summary
To address the dual bottlenecks of data scarcity and insufficient semantic correctness in automated C++ compilation error repair, this paper introduces the first large-scale, high-fidelity repair dataset and proposes a reinforcement learning (RL) framework guided by hybrid reward signals. We innovatively design an LLM-as-a-Judge two-stage evaluation mechanism that jointly verifies syntactic validity and semantic correctness to ensure patch quality. Furthermore, we integrate large language models (LLMs), RL, and human alignment via a generate-verify pipeline, enabling a closed loop for dataset construction and model training. Applying RL fine-tuning to Qwen2.5-1.5B-Instruct, our approach achieves repair performance comparable to that of 14B-class models—demonstrating substantial improvements in practical utility and scalability of small models within real-world development scenarios.
📝 Abstract
The automated repair of C++ compilation errors presents a significant challenge, the resolution of which is critical for developer productivity. Progress in this domain is constrained by two primary factors: the scarcity of large-scale, high-fidelity datasets and the limitations of conventional supervised methods, which often fail to generate semantically correct patches.This paper addresses these gaps by introducing a comprehensive framework with three core contributions. First, we present CCrepair, a novel, large-scale C++ compilation error dataset constructed through a sophisticated generate-and-verify pipeline. Second, we propose a Reinforcement Learning (RL) paradigm guided by a hybrid reward signal, shifting the focus from mere compilability to the semantic quality of the fix. Finally, we establish the robust, two-stage evaluation system providing this signal, centered on an LLM-as-a-Judge whose reliability has been rigorously validated against the collective judgments of a panel of human experts. This integrated approach aligns the training objective with generating high-quality, non-trivial patches that are both syntactically and semantically correct. The effectiveness of our approach was demonstrated experimentally. Our RL-trained Qwen2.5-1.5B-Instruct model achieved performance comparable to a Qwen2.5-14B-Instruct model, validating the efficiency of our training paradigm. Our work provides the research community with a valuable new dataset and a more effective paradigm for training and evaluating robust compilation repair models, paving the way for more practical and reliable automated programming assistants.