General Coded Computing in a Probabilistic Straggler Regime

📅 2025-02-02

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

In distributed approximate computing, servers independently straggle with probability $p$, causing estimation errors that fail to converge under conventional analyses. Method: This work introduces the first theoretical analysis under a probabilistic straggler model, integrating Berrut-type rational interpolation, learning-theoretic modeling, rigorous probabilistic analysis, and numerical experiments—applicable to general continuous functions, including deep neural networks. Contribution/Results: We prove that the average approximation errors of both BACC and LeTCC coding schemes converge to zero as the number of servers $N$ increases—breaking prior assumptions requiring a fixed number of non-stragglers. Specifically, we derive convergence rates of $O(log^3_{1/p} N cdot N^{-3})$ for BACC and $O(log^4_{1/p} N cdot N^{-2})$ for LeTCC. Extensive experiments validate substantial improvements over baseline schemes across diverse tasks.

Technology Category

Application Category

📝 Abstract

Coded computing has demonstrated promising results in addressing straggler resiliency in distributed computing systems. However, most coded computing schemes are designed for exact computation, requiring the number of responding servers to exceed a certain recovery threshold. Additionally, these schemes are tailored for highly structured functions. Recently, new coded computing schemes for general computing functions, where exact computation is replaced with approximate computation, have emerged. In these schemes, the availability of additional results corresponds to more accurate estimation of computational tasks. This flexibility introduces new questions that need to be addressed. This paper addresses the practically important scenario in the context of general coded computing, where each server may become a straggler with a probability $p$, independently from others. We theoretically analyze the approximation error of two existing general coded computing schemes: Berrut Approximate Coded Computing (BACC) and Learning Theoretic Coded Computing (LeTCC). Under the probabilistic straggler configuration, we demonstrate that the average approximation error for BACC and LeTCC converge to zero with the rate of at least $mathcal{O}(log^3_{frac{1}{p}}(N)cdot{N^{-3}})$ and $mathcal{O}(log^4_{frac{1}{p}}(N)cdot{N^{-2}})$, respectively. This is perhaps surprising, as earlier results does not indicate a convergence when the number of stragglers scales with the total number of servers $N$. However, in this case, despite the average number of stragglers being $Np$, the independence of servers in becoming stragglers allows the approximation error to converge to zero. These theoretical results are validated through experiments on various computing functions, including deep neural networks.

Problem

Research questions and friction points this paper is trying to address.

Distributed Computing

Straggler Mitigation

Approximate Computing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Distributed Computing

Error Convergence

Encoded Approximate Computing

🔎 Similar Papers

No similar papers found.

💼 Related Jobs

Senior Software Engineer, AI Resiliency

Nvidia

The base salary range is 184,000 USD - 287,500 USD. You will also be eligible for equity and benefits.

US, WA, Redmond / US, CA, Santa Clara

Authors to Follow