π€ AI Summary
Open-source projects face significant defect-introduction risks due to the overwhelming volume of pull requests (PRs).
Method: We propose Diff Risk Scoring (DRS), the first community-open, fine-grained differential risk scoring system. DRS jointly leverages a long-context (22K-token) fine-tuned Llama 3.1-8B sequence classifier and structured code-change features. It employs 4-bit QLoRA and DeepSpeed ZeRO-3 CPU offloading for efficient inference.
Contribution/Results: On the ApacheJIT benchmark, DRS achieves F1 = 0.64 and ROC-AUC = 0.89βstate-of-the-art performance. Simulation shows that filtering the top 30% highest-risk PRs intercepts 86.4% of defect-introducing changes. DRS is fully open-sourced and deployed with production-ready APIs, a web interface, and native GitHub integration. It significantly improves PR review prioritization, test scheduling, and CI/CD gating efficacy.
π Abstract
In large-scale open-source projects, hundreds of pull requests land daily, each a potential source of regressions. Diff Risk Scoring (DRS) estimates the likelihood that a diff will introduce a defect, enabling better review prioritization, test planning, and CI/CD gating. We present DRS-OSS, an open-source DRS system equipped with a public API, web UI, and GitHub plugin. DRS-OSS uses a fine-tuned Llama 3.1 8B sequence classifier trained on the ApacheJIT dataset, consuming long-context representations that combine commit messages, structured diffs, and change metrics. Through parameter-efficient adaptation, 4-bit QLoRA, and DeepSpeed ZeRO-3 CPU offloading, we train 22k-token contexts on a single 20 GB GPU. On the ApacheJIT benchmark, DRS-OSS achieves state-of-the-art performance (F1 = 0.64, ROC-AUC = 0.89). Simulations show that gating only the riskiest 30% of commits can prevent up to 86.4% of defect-inducing changes. The system integrates with developer workflows through an API gateway, a React dashboard, and a GitHub App that posts risk labels on pull requests. We release the full replication package, fine-tuning scripts, deployment artifacts, code, demo video, and public website.