🤖 AI Summary
Collaborative machine learning faces challenges including non-scalable resource management, poor adaptability to heterogeneous edge devices, and lack of multi-tenancy support. To address these, this paper proposes a scalable edge–cloud collaborative training platform featuring a novel decoupled control plane and data plane architecture. The platform integrates distributed scheduling, multi-tenant isolation, edge-level flow control, model distribution, and result aggregation, along with an adaptive client-side module for heterogeneous devices. It enables fine-grained resource sharing policies and cross-task coordinated scheduling, achieving high resource reuse while preserving tenant isolation. Experimental evaluation demonstrates significant improvements over state-of-the-art frameworks: 1.88× higher resource utilization, up to 2.76× higher throughput, and 1.26× reduction in task completion time.
📝 Abstract
Collaborative Machine Learning is a paradigm in the field of distributed machine learning, designed to address the challenges of data privacy, communication overhead, and model heterogeneity. There have been significant advancements in optimization and communication algorithm design and ML hardware that enables fair, efficient and secure collaborative ML training. However, less emphasis is put on collaborative ML infrastructure development. Developers and researchers often build server-client systems for a specific collaborative ML use case, which is not scalable and reusable. As the scale of collaborative ML grows, the need for a scalable, efficient, and ideally multi-tenant resource management system becomes more pressing. We propose a novel system, Propius, that can adapt to the heterogeneity of client machines, and efficiently manage and control the computation flow between ML jobs and edge resources in a scalable fashion. Propius is comprised of a control plane and a data plane. The control plane enables efficient resource sharing among multiple collaborative ML jobs and supports various resource sharing policies, while the data plane improves the scalability of collaborative ML model sharing and result collection. Evaluations show that Propius outperforms existing resource management techniques and frameworks in terms of resource utilization (up to $1.88 imes$), throughput (up to $2.76$), and job completion time (up to $1.26 imes$).