Automating Low-Risk Code Review at Meta: RADAR, Risk Calibration, and Review Efficiency

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This work addresses the growing bottleneck and latency in code review caused by the surge of AI-generated code submissions. The authors propose RADAR, a risk-calibrated, hierarchical automated review system deployed at industrial scale. RADAR employs a multi-stage funnel comprising author/source identification, admission rules, static heuristics, machine learning–based risk scoring, large language model–driven review, and deterministic validation, dynamically adjusting risk thresholds to balance automation rate and safety. Evaluated on 535,000 code changes, RADAR automatically merged 331,000 submissions. At a 50% risk threshold, it achieved a 60.31% approval rate, reduced rollbacks to one-third of those under manual review, lowered production incidents to 1/50th, and decreased median review time by over 330%, with wall-clock time reduced by 35%.

📝 Abstract

AI-assisted coding tools have altered software production. At Meta, significant lines of code per human-landed diff grew by 105.9% year over year and per-developer diff volume rose 51%, with agentic AI responsible for over 80% of that growth. Meanwhile, the share of diffs receiving timely review has declined, exposing a widening gap between code supply and reviewer bandwidth. We ask three questions that progress from feasibility through calibration to impact: (1) can risk-stratified automation operate at scale across diverse organizations, (2) how does tuning the risk threshold affect the trade-off between automation yield and safety, and (3) to what extent does automated review reduce end-to-end latency for AI-generated changes? We deployed RADAR (Risk Aware Diff Auto Review), a multi-stage funnel that classifies each diff by authorship and source type, applies eligibility gates, static heuristics, a machine-learned Diff Risk Score, LLM-based Automated Code Review, and deterministic validation before landing qualifying changes. We evaluate RADAR through telemetry covering 535K+ RADAR-reviewed diffs, observational before-after comparisons for policy changes, and difference-in-differences analysis of efficiency outcomes. RADAR has reviewed 535K+ diffs and landed 331K+. Relaxing the Diff Risk Score threshold from the 25th to the 50th percentile increased the approve rate to 60.31%. The revert rate for RADAR-reviewed diffs is 1/3 that of non-RADAR diffs, and the Production Incident rate is 1/50 that of non-RADAR diffs. RADAR reduces median time to close by over 330% and median diff review wall time by 35%. Risk-aware layered automation can materially reduce review bottlenecks created by AI-driven code growth without compromising production safety.

Problem

Research questions and friction points this paper is trying to address.

code review

AI-generated code

review bottleneck

risk stratification

automation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Risk-aware automation

Automated code review

Diff Risk Score

LLM-based code analysis

Review efficiency

🔎 Similar Papers

CRScore: Grounding Automated Evaluation of Code Review Comments in Code Claims and Smells

2024-09-29arXiv.orgCitations: 0

System for systematic literature review using multiple AI agents: Concept and an empirical evaluation

2024-03-13arXiv.orgCitations: 25