Automating Low-Risk Code Review at Meta: RADAR, Risk Calibration, and Review Efficiency

📅 2026-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the growing bottleneck and latency in code review caused by the surge of AI-generated code submissions. The authors propose RADAR, a risk-calibrated, hierarchical automated review system deployed at industrial scale. RADAR employs a multi-stage funnel comprising author/source identification, admission rules, static heuristics, machine learning–based risk scoring, large language model–driven review, and deterministic validation, dynamically adjusting risk thresholds to balance automation rate and safety. Evaluated on 535,000 code changes, RADAR automatically merged 331,000 submissions. At a 50% risk threshold, it achieved a 60.31% approval rate, reduced rollbacks to one-third of those under manual review, lowered production incidents to 1/50th, and decreased median review time by over 330%, with wall-clock time reduced by 35%.
📝 Abstract
AI-assisted coding tools have altered software production. At Meta, significant lines of code per human-landed diff grew by 105.9% year over year and per-developer diff volume rose 51%, with agentic AI responsible for over 80% of that growth. Meanwhile, the share of diffs receiving timely review has declined, exposing a widening gap between code supply and reviewer bandwidth. We ask three questions that progress from feasibility through calibration to impact: (1) can risk-stratified automation operate at scale across diverse organizations, (2) how does tuning the risk threshold affect the trade-off between automation yield and safety, and (3) to what extent does automated review reduce end-to-end latency for AI-generated changes? We deployed RADAR (Risk Aware Diff Auto Review), a multi-stage funnel that classifies each diff by authorship and source type, applies eligibility gates, static heuristics, a machine-learned Diff Risk Score, LLM-based Automated Code Review, and deterministic validation before landing qualifying changes. We evaluate RADAR through telemetry covering 535K+ RADAR-reviewed diffs, observational before-after comparisons for policy changes, and difference-in-differences analysis of efficiency outcomes. RADAR has reviewed 535K+ diffs and landed 331K+. Relaxing the Diff Risk Score threshold from the 25th to the 50th percentile increased the approve rate to 60.31%. The revert rate for RADAR-reviewed diffs is 1/3 that of non-RADAR diffs, and the Production Incident rate is 1/50 that of non-RADAR diffs. RADAR reduces median time to close by over 330% and median diff review wall time by 35%. Risk-aware layered automation can materially reduce review bottlenecks created by AI-driven code growth without compromising production safety.
Problem

Research questions and friction points this paper is trying to address.

code review
AI-generated code
review bottleneck
risk stratification
automation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Risk-aware automation
Automated code review
Diff Risk Score
LLM-based code analysis
Review efficiency
Chris Adams
Chris Adams
Congressional Budget Office
auctionsempirical industrial organizationdrug developmenteconometricsRoy model
A
Arjun Singh Banga
Meta
P
Parveen Bansal
Meta
S
Souvik Bhattacharya
Meta
R
Rujin Cao
Meta
P
Pedro Canahuati
Meta
N
Nate Cook
Meta
B
Brian Ellis
Meta
P
Prabhakar Goyal
Meta
G
Gurinder Grewal
Meta
Tianyu He
Tianyu He
Microsoft Research
machine learninggenerative modelsworld models
M
Matt Labunka
Meta
A
Alex Manners
Meta
David Molnar
David Molnar
Meta Platforms
Securityprogram analysisAI
G
Ging Cee Ng
Meta
V
Vishal Parekh
Meta
J
Jiefu Pei
Meta
F
Frederic Sagnes
Meta
J
James Saindon
Meta
W
Will Shackleton
Meta
S
Sid Sidhu
Meta
G
Gursharan Singh
Meta
K
Karthik Chengayan Sridhar
Meta
M
Matt Steiner
Meta
P
Pratibha Udmalpet
Meta