π€ AI Summary
This work addresses the inefficiency and high cost of traditional dual-redundancy disaster recovery models, which struggle to balance high availability with resource efficiency. The authors propose a criticality-aware, tiered disaster recovery architecture that automatically identifies core services through dependency analysis and allocates recovery resources differentially: critical services retain redundancy for guaranteed resilience, while non-critical services share buffer resources during steady state and are restored according to SLA-based priorities during failures. Integrated with regression guardrails, dynamic preemption, and on-demand scaling mechanisms, the system enables automated governance. Experimental results demonstrate that the approach reduces steady-state resource provisioning from 2x to 1.3x, improves resource utilization from 20% to 30%, maintains 99.97% availability, hardens over 4,000 unsafe dependencies, and saves more than one million CPU cores.
π Abstract
Operating a global, real-time platform at Uber's scale requires infrastructure that is both resilient and cost-efficient. Historically, reliability was ensured through a costly 2x capacity model--each service provisioned to handle global traffic independently across two regions--leaving half the fleet idle. We present Uber's Failover Architecture (UFA), which replaces the uniform 2x model with a differentiated architecture aligned to business criticality. Critical services retain failover guarantees, while non-critical services opportunistically use failover buffer capacity reserved for critical services during steady state. During rare"full-peak"failovers, non-critical services are selectively preempted and rapidly restored, with differentiated Service-Level Agreements (SLAs) using on-demand capacity. Automated safeguards, including dependency analysis and regression gates, ensure critical services continue to function even while non-critical services are unavailable. The quantitative impact is significant: UFA reduces steady-state provisioning from 2x to 1.3x, raising utilization from ~20% to ~30% while sustaining 99.97% availability. To date, UFA has hardened over 4,000 unsafe dependencies, eliminated over one million CPU cores from a baseline of about four million cores.