🤖 AI Summary
This paper investigates distributed computation in the congested clique model under adversarial node failures. Addressing the extreme setting where up to 99% of nodes may fail dynamically, we propose the first general-purpose fault-tolerant transformation framework that automatically converts any non-fault-tolerant algorithm into a low-overhead fault-tolerant variant, breaking the linear fault-tolerance overhead lower bound. Our approach integrates deterministic algorithm reconstruction, message-sensitivity- and round-locality-driven complexity refinement, redundant input learning, and adaptive reconfiguration protocols. Theoretical contributions include: (i) semiring matrix multiplication in $O(n^{1/3} operatorname{polylog} n)$ rounds; (ii) completion of $O(n log n)$ bits of work per node within $O(n)$ total communication rounds; and (iii) tolerance of $Theta(n)$ Byzantine faulty nodes.
📝 Abstract
We study a extsf{Faulty Congested Clique} model, in which an adversary may fail nodes in the network throughout the computation. We show that any task of $O(nlog{n})$-bit input per node can be solved in roughly $n$ rounds, where $n$ is the size of the network. This nearly matches the linear upper bound on the complexity of the non-faulty clique model for such problems, by learning the entire input, and it holds in the faulty model even with a linear number of faults. Our main contribution is that we establish that one can do much better by looking more closely at the computation. Given a deterministic algorithm $mathcal{A}$ for the non-faulty extsf{Congested Clique} model, we show how to transform it into an algorithm $mathcal{A}'$ for the faulty model, with an overhead that could be as small as some logarithmic-in-$n$ factor, by considering refined complexity measures of $mathcal{A}$. As an exemplifying application of our approach, we show that the $O(n^{1/3})$-round complexity of semi-ring matrix multiplication [Censor{-}Hillel, Kaski, Korhonen, Lenzen, Paz, Suomela, PODC 2015] remains the same up to polylog factors in the faulty model, even if the adversary can fail $99%$ of the nodes.