CCrepairBench: A High-Fidelity Benchmark and Reinforcement Learning Framework for C++ Compilation Repair

📅 2025-09-19

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

To address the dual bottlenecks of data scarcity and insufficient semantic correctness in automated C++ compilation error repair, this paper introduces the first large-scale, high-fidelity repair dataset and proposes a reinforcement learning (RL) framework guided by hybrid reward signals. We innovatively design an LLM-as-a-Judge two-stage evaluation mechanism that jointly verifies syntactic validity and semantic correctness to ensure patch quality. Furthermore, we integrate large language models (LLMs), RL, and human alignment via a generate-verify pipeline, enabling a closed loop for dataset construction and model training. Applying RL fine-tuning to Qwen2.5-1.5B-Instruct, our approach achieves repair performance comparable to that of 14B-class models—demonstrating substantial improvements in practical utility and scalability of small models within real-world development scenarios.

Technology Category

Application Category

📝 Abstract

The automated repair of C++ compilation errors presents a significant challenge, the resolution of which is critical for developer productivity. Progress in this domain is constrained by two primary factors: the scarcity of large-scale, high-fidelity datasets and the limitations of conventional supervised methods, which often fail to generate semantically correct patches.This paper addresses these gaps by introducing a comprehensive framework with three core contributions. First, we present CCrepair, a novel, large-scale C++ compilation error dataset constructed through a sophisticated generate-and-verify pipeline. Second, we propose a Reinforcement Learning (RL) paradigm guided by a hybrid reward signal, shifting the focus from mere compilability to the semantic quality of the fix. Finally, we establish the robust, two-stage evaluation system providing this signal, centered on an LLM-as-a-Judge whose reliability has been rigorously validated against the collective judgments of a panel of human experts. This integrated approach aligns the training objective with generating high-quality, non-trivial patches that are both syntactically and semantically correct. The effectiveness of our approach was demonstrated experimentally. Our RL-trained Qwen2.5-1.5B-Instruct model achieved performance comparable to a Qwen2.5-14B-Instruct model, validating the efficiency of our training paradigm. Our work provides the research community with a valuable new dataset and a more effective paradigm for training and evaluating robust compilation repair models, paving the way for more practical and reliable automated programming assistants.

Problem

Research questions and friction points this paper is trying to address.

Automated repair of C++ compilation errors

Addresses scarcity of high-fidelity datasets

Improves semantic correctness of generated patches

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale C++ dataset via generate-and-verify pipeline

Reinforcement Learning with hybrid semantic reward signals

LLM-as-a-Judge evaluation system validated by experts

🔎 Similar Papers

NAVRepair: Node-type Aware C/C++ Code Vulnerability Repair