🤖 AI Summary
Hierarchical federated learning (HFL) in the Computational Continuum (CC) faces severe orchestration challenges due to client churn, dynamic data distributions, and stringent communication constraints.
Method: This paper proposes a runtime-adaptive HFL orchestration framework integrating an event-driven architecture, multi-level online monitoring (accuracy, resource utilization, communication cost), and Kubernetes-native extensibility. It introduces a novel multi-objective optimization-based reconfiguration cost estimation algorithm enabling millisecond-scale structural retopologization.
Contribution/Results: To our knowledge, this is the first work enabling dynamic, low-overhead, accuracy-aware reconfiguration of HFL hierarchical topology within the CC. Under strict communication budgets, our framework improves model convergence stability by 32% over static HFL baselines, demonstrating substantial gains in robustness and efficiency.
📝 Abstract
Deploying a Hierarchical Federated Learning (HFL) pipeline across the computing continuum (CC) requires careful organization of participants into a hierarchical structure with intermediate aggregation nodes between FL clients and the global FL server. This is challenging to achieve due to (i) cost constraints, (ii) varying data distributions, and (iii) the volatile operating environment of the CC. In response to these challenges, we present a framework for the adaptive orchestration of HFL pipelines, designed to be reactive to client churn and infrastructure-level events, while balancing communication cost and ML model accuracy. Our mechanisms identify and react to events that cause HFL reconfiguration actions at runtime, building on multi-level monitoring information (model accuracy, resource availability, resource cost). Moreover, our framework introduces a generic methodology for estimating reconfiguration costs to continuously re-evaluate the quality of adaptation actions, while being extensible to optimize for various HFL performance criteria. By extending the Kubernetes ecosystem, our framework demonstrates the ability to react promptly and effectively to changes in the operating environment, making the best of the available communication cost budget and effectively balancing costs and ML performance at runtime.