Beyond Optimal Fault Tolerance

📅 2025-01-10

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This paper investigates recoverable fault tolerance in state machine replication (SMR) systems that permit at most (r) consistency violations, breaking the classical fault-tolerance bound (alpha + 2eta leq 1). Method: We introduce the theoretical framework of *recoverable fault tolerance*, establishing tight upper and lower bounds. Leveraging a partially synchronous model with delay-awareness ((Delta^*)), we design a violation-tolerant and recovery mechanism whose fault tolerance scales dynamically with (r). Our accountable SMR protocol and generic recovery procedure minimize rollback—reverting only confirmed transactions within the last (2Delta^*) time units. Results: We prove that an (frac{5}{9})-bounded adversary induces at most one violation, while a (frac{2}{3})-bounded adversary causes at most two; strong consistency is always restored post-recovery. The framework achieves optimal trade-offs between violation budget, synchrony assumptions, and resiliency.

Technology Category

Application Category

📝 Abstract

The optimal fault-tolerance achievable by any protocol has been characterized in a wide range of settings. For example, for state machine replication (SMR) protocols operating in the partially synchronous setting, it is possible to simultaneously guarantee consistency against $alpha$-bounded adversaries (i.e., adversaries that control less than an $alpha$ fraction of the participants) and liveness against $eta$-bounded adversaries if and only if $alpha + 2eta leq 1$. This paper characterizes to what extent"better-than-optimal"fault-tolerance guarantees are possible for SMR protocols when the standard consistency requirement is relaxed to allow a bounded number $r$ of consistency violations. We prove that bounding rollback is impossible without additional timing assumptions and investigate protocols that tolerate and recover from consistency violations whenever message delays around the time of an attack are bounded by a parameter $Delta^*$ (which may be arbitrarily larger than the parameter $Delta$ that bounds post-GST message delays in the partially synchronous model). Here, a protocol's fault-tolerance can be a non-constant function of $r$, and we prove, for each $r$, matching upper and lower bounds on the optimal ``recoverable fault-tolerance'' achievable by any SMR protocol. For example, for protocols that guarantee liveness against 1/3-bounded adversaries in the partially synchronous setting, a 5/9-bounded adversary can always cause one consistency violation but not two, and a 2/3-bounded adversary can always cause two consistency violations but not three. Our positive results are achieved through a generic ``recovery procedure'' that can be grafted on to any accountable SMR protocol and restores consistency following a violation while rolling back only transactions that were finalized in the previous $2Delta^*$ timesteps.

Problem

Research questions and friction points this paper is trying to address.

State Machine Replication

Fault Tolerance

Recovery under Delayed Attacks

Innovation

Methods, ideas, or system contributions that make the work stand out.

State Machine Replication

Error Recovery

Dynamic Error Limiting

🔎 Similar Papers

Failure Diagnosis in Microservice Systems: A Comprehensive Survey and Analysis