Learning in Chaos: Efficient Autoscaling and Self-healing for Distributed Training at the Edge

📅 2025-05-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Edge AI clusters suffer from frequent node and link dynamics, causing distributed training disruptions; conventional checkpoint-based recovery and cloud-centric scaling mechanisms exhibit high latency and weak decentralization support in edge environments. This paper proposes an edge-native self-healing and autonomous elastic training system. It introduces a novel multi-neighbor parallel state retrieval mechanism coupled with dynamic sharding-based scheduling, and designs a topology-aware lightweight cluster monitoring framework alongside a peer-to-peer negotiation protocol for fully decentralized scaling. Experiments demonstrate that our system achieves significantly lower scaling latency than baseline approaches (e.g., Pollux), with sub-millisecond response times (<1 ms) to node join/leave/failure events, minimal idle time, and optimal resource utilization and scalability.

Technology Category

Application Category

📝 Abstract
Frequent node and link changes in edge AI clusters disrupt distributed training, while traditional checkpoint-based recovery and cloud-centric autoscaling are too slow for scale-out and ill-suited to chaotic and self-governed edge. This paper proposes Chaos, a resilient and scalable edge distributed training system with built-in self-healing and autoscaling. It speeds up scale-out by using multi-neighbor replication with fast shard scheduling, allowing a new node to pull the latest training state from nearby neighbors in parallel while balancing the traffic load between them. It also uses a cluster monitor to track resource and topology changes to assist scheduler decisions, and handles scaling events through peer negotiation protocols, enabling fully self-governed autoscaling without a central admin. Extensive experiments show that Chaos consistently achieves much lower scale-out delays than Pollux, EDL, and Autoscaling, and handles scale-in, connect-link, and disconnect-link events within 1 millisecond, making it smoother to handle node joins, exits, and failures. It also delivers the lowest idle time, showing superior resource use and scalability as the cluster grows.
Problem

Research questions and friction points this paper is trying to address.

Efficient autoscaling for distributed training in chaotic edge environments
Self-healing mechanisms to handle frequent node and link changes
Decentralized self-governed system without central administration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-neighbor replication with fast shard scheduling
Cluster monitor for resource and topology tracking
Peer negotiation protocols for self-governed autoscaling
🔎 Similar Papers
No similar papers found.
W
Wenjiao Feng
School of Information and Communication Engineering, University of Electronic Science and Technology of China (UESTC), Chengdu, 611731, China
R
Rongxing Xiao
School of Information and Communication Engineering, University of Electronic Science and Technology of China (UESTC), Chengdu, 611731, China
Zonghang Li
Zonghang Li
MBZUAI
Distributed MLEdge AIOn-device LLM
Hongfang Yu
Hongfang Yu
UESTC
Network VirtualizationEdge/cloud ComputingMachine leaning Systems
G
Gang Sun
School of Information and Communication Engineering, University of Electronic Science and Technology of China (UESTC), Chengdu, 611731, China
Long Luo
Long Luo
University of Electronic Science and Technology of China, UESTC
networksdistributed systemsalgorithms
M
Mohsen Guizani
Department of Machine Learning, Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), Building 1B, Masdar City, Abu Dhabi, United Arab Emirates
Qirong Ho
Qirong Ho
Mohamed bin Zayed University of Artificial Intelligence (MBZUAI) and Petuum, Inc