Uber's Failover Architecture: Reconciling Reliability and Efficiency in Hyperscale Microservice Infrastructure

πŸ“… 2026-03-07
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the inefficiency and high cost of traditional dual-redundancy disaster recovery models, which struggle to balance high availability with resource efficiency. The authors propose a criticality-aware, tiered disaster recovery architecture that automatically identifies core services through dependency analysis and allocates recovery resources differentially: critical services retain redundancy for guaranteed resilience, while non-critical services share buffer resources during steady state and are restored according to SLA-based priorities during failures. Integrated with regression guardrails, dynamic preemption, and on-demand scaling mechanisms, the system enables automated governance. Experimental results demonstrate that the approach reduces steady-state resource provisioning from 2x to 1.3x, improves resource utilization from 20% to 30%, maintains 99.97% availability, hardens over 4,000 unsafe dependencies, and saves more than one million CPU cores.

Technology Category

Application Category

πŸ“ Abstract
Operating a global, real-time platform at Uber's scale requires infrastructure that is both resilient and cost-efficient. Historically, reliability was ensured through a costly 2x capacity model--each service provisioned to handle global traffic independently across two regions--leaving half the fleet idle. We present Uber's Failover Architecture (UFA), which replaces the uniform 2x model with a differentiated architecture aligned to business criticality. Critical services retain failover guarantees, while non-critical services opportunistically use failover buffer capacity reserved for critical services during steady state. During rare"full-peak"failovers, non-critical services are selectively preempted and rapidly restored, with differentiated Service-Level Agreements (SLAs) using on-demand capacity. Automated safeguards, including dependency analysis and regression gates, ensure critical services continue to function even while non-critical services are unavailable. The quantitative impact is significant: UFA reduces steady-state provisioning from 2x to 1.3x, raising utilization from ~20% to ~30% while sustaining 99.97% availability. To date, UFA has hardened over 4,000 unsafe dependencies, eliminated over one million CPU cores from a baseline of about four million cores.
Problem

Research questions and friction points this paper is trying to address.

failover
microservice
reliability
resource efficiency
hyperscale
Innovation

Methods, ideas, or system contributions that make the work stand out.

Failover Architecture
Differentiated SLAs
Resource Utilization
Microservice Resilience
Capacity Optimization
πŸ”Ž Similar Papers
No similar papers found.
Mayank Bansal
Mayank Bansal
Gen AI Leader @ AWS AI Labs, Ex-AmazonGo/JWO, Ex-Waymo/Google-Self-Driving, Ex-Sarnoff/SRI
AgentsAGIGenAIComputer VisionRobotics
Milind Chabbi
Milind Chabbi
Uber Technologies.
Parallel ComputingCompilersand Performance Analysis
K
Kenneth BΓΈgh
Uber Technologies
S
Srikanth Prodduturi
Uber Technologies
K
Kevin Xu
Uber Technologies
A
Amit Kumar
Uber Technologies
D
David Bell
Uber Technologies
R
Ranjib Dey
Uber Technologies
Y
Yufei Ren
Uber Technologies
S
Sachin Sharma
Uber Technologies
J
Juan Marcano
Uber Technologies
S
Shriniket Kale
Uber Technologies
Subhav Pradhan
Subhav Pradhan
Uber Technologies Inc., Vanderbilt University (Ph.D.)
Distributed ComputingResilient SystemsDeployment and Configuration
Ivan Beschastnikh
Ivan Beschastnikh
University of British Columbia
Distributed SystemsSoftware EngineeringSystems
M
Miguel Covarrubias
Uber Technologies
C
Chien-Chih Liao
Uber Technologies
S
Sandeep Koushik Sheshadri
Uber Technologies
Wen Luo
Wen Luo
Peking University
Kai Song
Kai Song
TikTok Inc.
NLP & LLM
A
Ashish Samant
Uber Technologies
S
Sahil Rihan
Uber Technologies
N
Nimish Sheth
Uber Technologies
U
Uday Kiran Medisetty
Uber Technologies