Uber's Failover Architecture: Reconciling Reliability and Efficiency in Hyperscale Microservice Infrastructure

📅 2026-03-07

📈 Citations: 0

✨ Influential: 0

career value

248K/year

🤖 AI Summary

This work addresses the inefficiency and high cost of traditional dual-redundancy disaster recovery models, which struggle to balance high availability with resource efficiency. The authors propose a criticality-aware, tiered disaster recovery architecture that automatically identifies core services through dependency analysis and allocates recovery resources differentially: critical services retain redundancy for guaranteed resilience, while non-critical services share buffer resources during steady state and are restored according to SLA-based priorities during failures. Integrated with regression guardrails, dynamic preemption, and on-demand scaling mechanisms, the system enables automated governance. Experimental results demonstrate that the approach reduces steady-state resource provisioning from 2x to 1.3x, improves resource utilization from 20% to 30%, maintains 99.97% availability, hardens over 4,000 unsafe dependencies, and saves more than one million CPU cores.

Technology Category

Application Category

📝 Abstract

Operating a global, real-time platform at Uber's scale requires infrastructure that is both resilient and cost-efficient. Historically, reliability was ensured through a costly 2x capacity model--each service provisioned to handle global traffic independently across two regions--leaving half the fleet idle. We present Uber's Failover Architecture (UFA), which replaces the uniform 2x model with a differentiated architecture aligned to business criticality. Critical services retain failover guarantees, while non-critical services opportunistically use failover buffer capacity reserved for critical services during steady state. During rare"full-peak"failovers, non-critical services are selectively preempted and rapidly restored, with differentiated Service-Level Agreements (SLAs) using on-demand capacity. Automated safeguards, including dependency analysis and regression gates, ensure critical services continue to function even while non-critical services are unavailable. The quantitative impact is significant: UFA reduces steady-state provisioning from 2x to 1.3x, raising utilization from ~20% to ~30% while sustaining 99.97% availability. To date, UFA has hardened over 4,000 unsafe dependencies, eliminated over one million CPU cores from a baseline of about four million cores.

Problem

Research questions and friction points this paper is trying to address.

failover

microservice

reliability

resource efficiency

hyperscale

Innovation

Methods, ideas, or system contributions that make the work stand out.

Failover Architecture

Differentiated SLAs

Resource Utilization