Learning in Chaos: Efficient Autoscaling and Self-healing for Distributed Training at the Edge

📅 2025-05-19

📈 Citations: 0

✨ Influential: 0

career value

252K/year

🤖 AI Summary

Edge AI clusters suffer from frequent node and link dynamics, causing distributed training disruptions; conventional checkpoint-based recovery and cloud-centric scaling mechanisms exhibit high latency and weak decentralization support in edge environments. This paper proposes an edge-native self-healing and autonomous elastic training system. It introduces a novel multi-neighbor parallel state retrieval mechanism coupled with dynamic sharding-based scheduling, and designs a topology-aware lightweight cluster monitoring framework alongside a peer-to-peer negotiation protocol for fully decentralized scaling. Experiments demonstrate that our system achieves significantly lower scaling latency than baseline approaches (e.g., Pollux), with sub-millisecond response times (<1 ms) to node join/leave/failure events, minimal idle time, and optimal resource utilization and scalability.

Technology Category

Application Category

📝 Abstract

Frequent node and link changes in edge AI clusters disrupt distributed training, while traditional checkpoint-based recovery and cloud-centric autoscaling are too slow for scale-out and ill-suited to chaotic and self-governed edge. This paper proposes Chaos, a resilient and scalable edge distributed training system with built-in self-healing and autoscaling. It speeds up scale-out by using multi-neighbor replication with fast shard scheduling, allowing a new node to pull the latest training state from nearby neighbors in parallel while balancing the traffic load between them. It also uses a cluster monitor to track resource and topology changes to assist scheduler decisions, and handles scaling events through peer negotiation protocols, enabling fully self-governed autoscaling without a central admin. Extensive experiments show that Chaos consistently achieves much lower scale-out delays than Pollux, EDL, and Autoscaling, and handles scale-in, connect-link, and disconnect-link events within 1 millisecond, making it smoother to handle node joins, exits, and failures. It also delivers the lowest idle time, showing superior resource use and scalability as the cluster grows.

Problem

Research questions and friction points this paper is trying to address.

Efficient autoscaling for distributed training in chaotic edge environments

Self-healing mechanisms to handle frequent node and link changes

Decentralized self-governed system without central administration

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-neighbor replication with fast shard scheduling

Cluster monitor for resource and topology tracking

Peer negotiation protocols for self-governed autoscaling

🔎 Similar Papers

No similar papers found.