Attention-Informed Surrogates for Navigating Power-Performance Trade-offs in HPC

📅 2026-01-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of balancing job performance and system power constraints in high-performance computing (HPC) scheduling, where automatically selecting the optimal number of compute nodes is critical. The authors propose a novel surrogate model that, for the first time, integrates an attention mechanism into a multi-objective Bayesian optimization framework. By leveraging job telemetry data, the model captures the complex relationship between runtime and power consumption, while an intelligent sampling strategy enhances data efficiency. Evaluated on two real-world HPC datasets, the approach significantly outperforms baseline methods: it efficiently generates high-quality Pareto fronts for the performance–power trade-off, substantially reduces training costs, and improves optimization stability.

Technology Category

Application Category

📝 Abstract
High-Performance Computing (HPC) schedulers must balance user performance with facility-wide resource constraints. The task boils down to selecting the optimal number of nodes for a given job. We present a surrogate-assisted multi-objective Bayesian optimization (MOBO) framework to automate this complex decision. Our core hypothesis is that surrogate models informed by attention-based embeddings of job telemetry can capture performance dynamics more effectively than standard regression techniques. We pair this with an intelligent sample acquisition strategy to ensure the approach is data-efficient. On two production HPC datasets, our embedding-informed method consistently identified higher-quality Pareto fronts of runtime-power trade-offs compared to baselines. Furthermore, our intelligent data sampling strategy drastically reduced training costs while improving the stability of the results. To our knowledge, this is the first work to successfully apply embedding-informed surrogates in a MOBO framework to the HPC scheduling problem, jointly optimizing for performance and power on production workloads.
Problem

Research questions and friction points this paper is trying to address.

HPC scheduling
power-performance trade-off
node allocation
multi-objective optimization
job telemetry
Innovation

Methods, ideas, or system contributions that make the work stand out.

attention-based embeddings
surrogate-assisted MOBO
power-performance trade-off
HPC scheduling
intelligent sampling
🔎 Similar Papers
No similar papers found.
A
Ashna Nawar Ahmed
Texas State University
B
Banooqa H. Banday
Texas State University
Terry Jones
Terry Jones
Computer Scientist, Oak Ridge National Laboratory
High Performance ComputingSystem SoftwareOperating and Runtime Systems
T
Tanzima Z. Islam
Texas State University