Dual-Lagrange Encoding for Storage and Download in Elastic Computing for Resilience

📅 2025-01-28
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Existing approaches to large-scale matrix multiplication in elastic computing suffer from poor straggler tolerance, high data upload overhead, and excessive storage redundancy—requiring full local storage of input matrices. To address these challenges, this paper proposes a novel two-stage Lagrange coding scheme, the first to jointly optimize both storage and download efficiency in elastic environments. The method supports VM preemption, dynamic scaling, and heterogeneous node scheduling, reducing storage overhead to 1/L of Zhong et al.’s scheme and overcoming limitations of conventional full-storage or fault-intolerant designs. By integrating distributed matrix blocking with heterogeneous and cyclic task assignment strategies, our approach achieves load balancing, minimizes total computation time and computational redundancy, and significantly improves fault tolerance and resource utilization—as empirically validated on AWS EC2.

Technology Category

Application Category

📝 Abstract
Coded elastic computing enables virtual machines to be preempted for high-priority tasks while allowing new virtual machines to join ongoing computation seamlessly. This paper addresses coded elastic computing for matrix-matrix multiplications with straggler tolerance by encoding both storage and download using Lagrange codes. In 2018, Yang et al. introduced the first coded elastic computing scheme for matrix-matrix multiplications, achieving a lower computational load requirement. However, this scheme lacks straggler tolerance and suffers from high upload cost. Zhong et al. (2023) later tackled these shortcomings by employing uncoded storage and Lagrange-coded download. However, their approach requires each machine to store the entire dataset. This paper introduces a new class of elastic computing schemes that utilize Lagrange codes to encode both storage and download, achieving a reduced storage size. The proposed schemes efficiently mitigate both elasticity and straggler effects, with a storage size reduced to a fraction $frac{1}{L}$ of Zhong et al.'s approach, at the expense of doubling the download cost. Moreover, we evaluate the proposed schemes on AWS EC2 by measuring computation time under two different tasks allocations: heterogeneous and cyclic assignments. Both assignments minimize computation redundancy of the system while distributing varying computation loads across machines.
Problem

Research questions and friction points this paper is trying to address.

Large Matrix Multiplication
Slow Machine Tolerance
Storage Efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual Lagrangian Encoding
Efficient Matrix Multiplication
Flexible Task Allocation
🔎 Similar Papers
No similar papers found.
X
Xi Zhong
Department of Electrical and Computer Engineering, University of Florida, Gainesville, FL, USA
S
Samuel Lu
Rowland Hall St. Marks High School, Salt Lake City, UT, USA
Joerg Kliewer
Joerg Kliewer
Professor, New Jersey Institute of Technology
Privacy and SecurityDistributed ComputingMachine LearningCoding TheoryInformation Theory
Mingyue Ji
Mingyue Ji
University of Florida
Information TheoryMachine LearningCommunication and NetworkingOptimization