Attention-Informed Surrogates for Navigating Power-Performance Trade-offs in HPC

📅 2026-01-21

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This work addresses the challenge of balancing job performance and system power constraints in high-performance computing (HPC) scheduling, where automatically selecting the optimal number of compute nodes is critical. The authors propose a novel surrogate model that, for the first time, integrates an attention mechanism into a multi-objective Bayesian optimization framework. By leveraging job telemetry data, the model captures the complex relationship between runtime and power consumption, while an intelligent sampling strategy enhances data efficiency. Evaluated on two real-world HPC datasets, the approach significantly outperforms baseline methods: it efficiently generates high-quality Pareto fronts for the performance–power trade-off, substantially reduces training costs, and improves optimization stability.

Technology Category

Application Category

📝 Abstract

High-Performance Computing (HPC) schedulers must balance user performance with facility-wide resource constraints. The task boils down to selecting the optimal number of nodes for a given job. We present a surrogate-assisted multi-objective Bayesian optimization (MOBO) framework to automate this complex decision. Our core hypothesis is that surrogate models informed by attention-based embeddings of job telemetry can capture performance dynamics more effectively than standard regression techniques. We pair this with an intelligent sample acquisition strategy to ensure the approach is data-efficient. On two production HPC datasets, our embedding-informed method consistently identified higher-quality Pareto fronts of runtime-power trade-offs compared to baselines. Furthermore, our intelligent data sampling strategy drastically reduced training costs while improving the stability of the results. To our knowledge, this is the first work to successfully apply embedding-informed surrogates in a MOBO framework to the HPC scheduling problem, jointly optimizing for performance and power on production workloads.

Problem

Research questions and friction points this paper is trying to address.

HPC scheduling

power-performance trade-off

node allocation

multi-objective optimization

job telemetry

Innovation

Methods, ideas, or system contributions that make the work stand out.

attention-based embeddings

surrogate-assisted MOBO

power-performance trade-off