🤖 AI Summary
This work addresses the growing bottleneck and latency in code review caused by the surge of AI-generated code submissions. The authors propose RADAR, a risk-calibrated, hierarchical automated review system deployed at industrial scale. RADAR employs a multi-stage funnel comprising author/source identification, admission rules, static heuristics, machine learning–based risk scoring, large language model–driven review, and deterministic validation, dynamically adjusting risk thresholds to balance automation rate and safety. Evaluated on 535,000 code changes, RADAR automatically merged 331,000 submissions. At a 50% risk threshold, it achieved a 60.31% approval rate, reduced rollbacks to one-third of those under manual review, lowered production incidents to 1/50th, and decreased median review time by over 330%, with wall-clock time reduced by 35%.
📝 Abstract
AI-assisted coding tools have altered software production. At Meta, significant lines of code per human-landed diff grew by 105.9% year over year and per-developer diff volume rose 51%, with agentic AI responsible for over 80% of that growth. Meanwhile, the share of diffs receiving timely review has declined, exposing a widening gap between code supply and reviewer bandwidth. We ask three questions that progress from feasibility through calibration to impact: (1) can risk-stratified automation operate at scale across diverse organizations, (2) how does tuning the risk threshold affect the trade-off between automation yield and safety, and (3) to what extent does automated review reduce end-to-end latency for AI-generated changes? We deployed RADAR (Risk Aware Diff Auto Review), a multi-stage funnel that classifies each diff by authorship and source type, applies eligibility gates, static heuristics, a machine-learned Diff Risk Score, LLM-based Automated Code Review, and deterministic validation before landing qualifying changes. We evaluate RADAR through telemetry covering 535K+ RADAR-reviewed diffs, observational before-after comparisons for policy changes, and difference-in-differences analysis of efficiency outcomes. RADAR has reviewed 535K+ diffs and landed 331K+. Relaxing the Diff Risk Score threshold from the 25th to the 50th percentile increased the approve rate to 60.31%. The revert rate for RADAR-reviewed diffs is 1/3 that of non-RADAR diffs, and the Production Incident rate is 1/50 that of non-RADAR diffs. RADAR reduces median time to close by over 330% and median diff review wall time by 35%. Risk-aware layered automation can materially reduce review bottlenecks created by AI-driven code growth without compromising production safety.