CI-Repair-Bench: A Repository-Aware Benchmark for Automated Patch Validation via CI Workflows

📅 2026-04-29

📈 Citations: 0

✨ Influential: 0

career value

145K/year

🤖 AI Summary

Existing program repair benchmarks inadequately reflect real-world repository-level continuous integration (CI) scenarios, as they overlook critical challenges such as non-code artifacts, environmental dependencies, and workflow constraints. This work introduces the first repository-level repair benchmark grounded in actual GitHub Actions executions, validating patches through faithful replay of original CI workflows. The benchmark includes 567 CI failures meticulously annotated into 12 fine-grained error categories. Innovatively adopting end-to-end CI workflow re-execution as the patch validation criterion, it enables error-type-aware evaluation. By integrating log analysis, fault localization, and large language model–generated candidate patches, the approach achieves strong performance on tool-enforced errors like formatting and static checks, attaining an overall best repair success rate of 18.9%, while environment- and configuration-related issues remain notably challenging.

📝 Abstract

Continuous Integration (CI) enforces repository-level correctness through multi-stage workflows and is central to modern software development, yet diagnosing and repairing CI failures remains challenging. Unlike traditional program repair, CI failures frequently involve non-code artifacts, environment and dependency issues, noisy execution logs, and workflow-level constraints. Existing program repair benchmarks fall short in this setting: they are largely test-centric, restrict repairs to source code, assume fixed execution environments, and evaluate under simplified CI workflows that do not reflect real repository-level validation. We introduce CI-Repair-Bench, a benchmark for CI-verified, repository-level program repair constructed from real GitHub Actions executions. It contains 567 CI failure instances from 103 repositories and evaluates repair correctness exclusively through full CI re-execution under original workflows. Failures are categorized into 12 CI error types, enabling fine-grained, error-type-aware evaluation. To demonstrate benchmark usage, we include a reference CI repair workflow that analyzes CI logs to localize faults and generate candidate patches. Empirical results show that automated repair is most effective for localized, tool-enforced failures such as formatting and linting, while environment, dependency, and configuration-related failures remain challenging; the best-performing LLM achieves an 18.9% repair success rate. CI-Repair-Bench provides a realistic evaluation foundation for advancing research on CI-native automated program repair.

Problem

Research questions and friction points this paper is trying to address.

Continuous Integration

Program Repair

CI Failures

Repository-level Validation

Automated Patch Validation

Innovation

Methods, ideas, or system contributions that make the work stand out.

CI-aware program repair

repository-level validation

automated patch validation