🤖 AI Summary
To address resource contention, performance degradation, increased latency, and poor energy efficiency arising from suboptimal hardware resource allocation in distributed systems, this paper proposes an online recommendation framework for dynamic hardware adaptation. It introduces, for the first time, contextual multi-armed bandits (Contextual MAB) to hardware selection—enabling offline-training-free, online continual learning with principled exploration-exploitation trade-offs, thereby departing from conventional data-intensive paradigms. The framework integrates real-time performance feedback modeling and native interfaces to NDP platforms, ensuring zero-friction deployment. Evaluated on three realistic workloads—Cycles, BurnPro3D, and matrix multiplication—the framework achieves significantly improved resource utilization, reduces end-to-end latency by 27.4% on average, and effectively mitigates priority inversion and system instability.
📝 Abstract
Distributed computing systems are essential for meeting the demands of modern applications, yet transitioning from single-system to distributed environments presents significant challenges. Misallocating resources in shared systems can lead to resource contention, system instability, degraded performance, priority inversion, inefficient utilization, increased latency, and environmental impact. We present BanditWare, an online recommendation system that dynamically selects the most suitable hardware for applications using a contextual multi-armed bandit algorithm. BanditWare balances exploration and exploitation, gradually refining its hardware recommendations based on observed application performance while continuing to explore potentially better options. Unlike traditional statistical and machine learning approaches that rely heavily on large historical datasets, BanditWare operates online, learning and adapting in real-time as new workloads arrive. We evaluated BanditWare on three workflow applications: Cycles (an agricultural science scientific workflow) BurnPro3D (a web-based platform for fire science) and a matrix multiplication application. Designed for seamless integration with the National Data Platform (NDP), BanditWare enables users of all experience levels to optimize resource allocation efficiently.