Dual-Lagrange Encoding for Storage and Download in Elastic Computing for Resilience

📅 2025-01-28

📈 Citations: 1

✨ Influential: 0

career value

226K/year

🤖 AI Summary

Existing approaches to large-scale matrix multiplication in elastic computing suffer from poor straggler tolerance, high data upload overhead, and excessive storage redundancy—requiring full local storage of input matrices. To address these challenges, this paper proposes a novel two-stage Lagrange coding scheme, the first to jointly optimize both storage and download efficiency in elastic environments. The method supports VM preemption, dynamic scaling, and heterogeneous node scheduling, reducing storage overhead to 1/L of Zhong et al.’s scheme and overcoming limitations of conventional full-storage or fault-intolerant designs. By integrating distributed matrix blocking with heterogeneous and cyclic task assignment strategies, our approach achieves load balancing, minimizes total computation time and computational redundancy, and significantly improves fault tolerance and resource utilization—as empirically validated on AWS EC2.

Technology Category

Application Category

📝 Abstract

Coded elastic computing enables virtual machines to be preempted for high-priority tasks while allowing new virtual machines to join ongoing computation seamlessly. This paper addresses coded elastic computing for matrix-matrix multiplications with straggler tolerance by encoding both storage and download using Lagrange codes. In 2018, Yang et al. introduced the first coded elastic computing scheme for matrix-matrix multiplications, achieving a lower computational load requirement. However, this scheme lacks straggler tolerance and suffers from high upload cost. Zhong et al. (2023) later tackled these shortcomings by employing uncoded storage and Lagrange-coded download. However, their approach requires each machine to store the entire dataset. This paper introduces a new class of elastic computing schemes that utilize Lagrange codes to encode both storage and download, achieving a reduced storage size. The proposed schemes efficiently mitigate both elasticity and straggler effects, with a storage size reduced to a fraction $frac{1}{L}$ of Zhong et al.'s approach, at the expense of doubling the download cost. Moreover, we evaluate the proposed schemes on AWS EC2 by measuring computation time under two different tasks allocations: heterogeneous and cyclic assignments. Both assignments minimize computation redundancy of the system while distributing varying computation loads across machines.

Problem

Research questions and friction points this paper is trying to address.

Large Matrix Multiplication

Slow Machine Tolerance

Storage Efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual Lagrangian Encoding

Efficient Matrix Multiplication

Flexible Task Allocation

🔎 Similar Papers

On Scalable Integrity Checking for Secure Cloud Disks