A Fault-Tolerant Version of Safra's Termination Detection Algorithm

📅 2026-01-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the inability of Safra’s termination detection algorithm to tolerate node failures in distributed systems, which compromises its correctness under crash faults. To overcome this limitation, the paper proposes a fully decentralized fault-tolerant extension that replaces the global token counter with per-node local counters. Upon node crashes, the protocol state is recovered through reconstruction of local token rings and a backup token mechanism. The approach supports an arbitrary number of concurrent node failures without incurring additional communication overhead. Leveraging a logical token ring structure, combined with distributed failure detection and formal verification—including correctness proofs and model checking—the study achieves, for the first time, strong fault tolerance for Safra’s algorithm and rigorously validates the correctness of both the original and the fault-tolerant variants.

Technology Category

Application Category

📝 Abstract
Safra's distributed termination detection algorithm employs a logical token ring structure within a distributed network; only passive nodes forward the token, and a counter in the token keeps track of the number of sent minus the number of received messages. We adapt this classic algorithm to make it fault-tolerant. The counter is split into counters per node, to discard counts from crashed nodes. If a node crashes, the token ring is restored locally and a backup token is sent. Nodes inform each other of detected crashes via the token. Our algorithm imposes no additional message overhead, tolerates any number of crashes as well as simultaneous crashes, and copes with crashes in a decentralized fashion. Correctness proofs are provided of both the original Safra's algorithm and its fault-tolerant variant, as well as a model checking analysis.
Problem

Research questions and friction points this paper is trying to address.

fault-tolerant
termination detection
distributed algorithm
node crash
Safra's algorithm
Innovation

Methods, ideas, or system contributions that make the work stand out.

fault-tolerant
termination detection
distributed algorithm
token ring
model checking
🔎 Similar Papers
No similar papers found.
Wan Fokkink
Wan Fokkink
Professor of Computer Science, Vrije Universiteit Amsterdam
concurrency theoryformal methodssupervisory controldistributed algorithms
G
Georgios Karlos
Department of Computer Science, Paderborn University
A
Andy S. Tatman
Bernoulli Institute, University of Groningen