🤖 AI Summary
This work addresses the challenge of balancing job performance and system power constraints in high-performance computing (HPC) scheduling, where automatically selecting the optimal number of compute nodes is critical. The authors propose a novel surrogate model that, for the first time, integrates an attention mechanism into a multi-objective Bayesian optimization framework. By leveraging job telemetry data, the model captures the complex relationship between runtime and power consumption, while an intelligent sampling strategy enhances data efficiency. Evaluated on two real-world HPC datasets, the approach significantly outperforms baseline methods: it efficiently generates high-quality Pareto fronts for the performance–power trade-off, substantially reduces training costs, and improves optimization stability.
📝 Abstract
High-Performance Computing (HPC) schedulers must balance user performance with facility-wide resource constraints. The task boils down to selecting the optimal number of nodes for a given job. We present a surrogate-assisted multi-objective Bayesian optimization (MOBO) framework to automate this complex decision. Our core hypothesis is that surrogate models informed by attention-based embeddings of job telemetry can capture performance dynamics more effectively than standard regression techniques. We pair this with an intelligent sample acquisition strategy to ensure the approach is data-efficient. On two production HPC datasets, our embedding-informed method consistently identified higher-quality Pareto fronts of runtime-power trade-offs compared to baselines. Furthermore, our intelligent data sampling strategy drastically reduced training costs while improving the stability of the results. To our knowledge, this is the first work to successfully apply embedding-informed surrogates in a MOBO framework to the HPC scheduling problem, jointly optimizing for performance and power on production workloads.