🤖 AI Summary
Edge AI clusters suffer from frequent node and link dynamics, causing distributed training disruptions; conventional checkpoint-based recovery and cloud-centric scaling mechanisms exhibit high latency and weak decentralization support in edge environments. This paper proposes an edge-native self-healing and autonomous elastic training system. It introduces a novel multi-neighbor parallel state retrieval mechanism coupled with dynamic sharding-based scheduling, and designs a topology-aware lightweight cluster monitoring framework alongside a peer-to-peer negotiation protocol for fully decentralized scaling. Experiments demonstrate that our system achieves significantly lower scaling latency than baseline approaches (e.g., Pollux), with sub-millisecond response times (<1 ms) to node join/leave/failure events, minimal idle time, and optimal resource utilization and scalability.
📝 Abstract
Frequent node and link changes in edge AI clusters disrupt distributed training, while traditional checkpoint-based recovery and cloud-centric autoscaling are too slow for scale-out and ill-suited to chaotic and self-governed edge. This paper proposes Chaos, a resilient and scalable edge distributed training system with built-in self-healing and autoscaling. It speeds up scale-out by using multi-neighbor replication with fast shard scheduling, allowing a new node to pull the latest training state from nearby neighbors in parallel while balancing the traffic load between them. It also uses a cluster monitor to track resource and topology changes to assist scheduler decisions, and handles scaling events through peer negotiation protocols, enabling fully self-governed autoscaling without a central admin. Extensive experiments show that Chaos consistently achieves much lower scale-out delays than Pollux, EDL, and Autoscaling, and handles scale-in, connect-link, and disconnect-link events within 1 millisecond, making it smoother to handle node joins, exits, and failures. It also delivers the lowest idle time, showing superior resource use and scalability as the cluster grows.