DRS-OSS: LLM-Driven Diff Risk Scoring Tool for PR Risk Prediction

📅 2025-11-26

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Open-source projects face significant defect-introduction risks due to the overwhelming volume of pull requests (PRs). Method: We propose Diff Risk Scoring (DRS), the first community-open, fine-grained differential risk scoring system. DRS jointly leverages a long-context (22K-token) fine-tuned Llama 3.1-8B sequence classifier and structured code-change features. It employs 4-bit QLoRA and DeepSpeed ZeRO-3 CPU offloading for efficient inference. Contribution/Results: On the ApacheJIT benchmark, DRS achieves F1 = 0.64 and ROC-AUC = 0.89—state-of-the-art performance. Simulation shows that filtering the top 30% highest-risk PRs intercepts 86.4% of defect-introducing changes. DRS is fully open-sourced and deployed with production-ready APIs, a web interface, and native GitHub integration. It significantly improves PR review prioritization, test scheduling, and CI/CD gating efficacy.

Technology Category

Application Category

📝 Abstract

In large-scale open-source projects, hundreds of pull requests land daily, each a potential source of regressions. Diff Risk Scoring (DRS) estimates the likelihood that a diff will introduce a defect, enabling better review prioritization, test planning, and CI/CD gating. We present DRS-OSS, an open-source DRS system equipped with a public API, web UI, and GitHub plugin. DRS-OSS uses a fine-tuned Llama 3.1 8B sequence classifier trained on the ApacheJIT dataset, consuming long-context representations that combine commit messages, structured diffs, and change metrics. Through parameter-efficient adaptation, 4-bit QLoRA, and DeepSpeed ZeRO-3 CPU offloading, we train 22k-token contexts on a single 20 GB GPU. On the ApacheJIT benchmark, DRS-OSS achieves state-of-the-art performance (F1 = 0.64, ROC-AUC = 0.89). Simulations show that gating only the riskiest 30% of commits can prevent up to 86.4% of defect-inducing changes. The system integrates with developer workflows through an API gateway, a React dashboard, and a GitHub App that posts risk labels on pull requests. We release the full replication package, fine-tuning scripts, deployment artifacts, code, demo video, and public website.

Problem

Research questions and friction points this paper is trying to address.

Predicts defect risk in pull requests for prioritization

Uses fine-tuned LLM to analyze commit messages and diffs

Integrates with developer workflows via API and GitHub plugin

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuned Llama 3.1 8B classifier for diff risk prediction

Parameter-efficient training with QLoRA and DeepSpeed on single GPU

Integration via API, React dashboard, and GitHub App plugin

🔎 Similar Papers

No similar papers found.