COMPASS: A Unified Decision-Intelligence System for Navigating Performance Trade-off in HPC

📅 2026-04-24

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work addresses the challenge of configuring high-performance computing (HPC) systems, where existing tuning tools struggle to deliver minimal, interpretable recommendations while respecting domain-specific constraints. The study introduces a novel formulation of configuration tuning as a queryable decision intelligence task, presenting an interactive decision engine built upon runtime trace data. By integrating machine learning, uncertainty quantification, active learning, and an HPC scheduler simulator, the proposed approach enables minimal-change recommendations, constraint-aware tuning, and confidence assessment. Evaluated on a dataset of 1.3 billion samples (126 GB), the method achieves up to 100× faster training and 80× faster inference compared to baseline approaches. It reduces average job turnaround time by 65.93% and node utilization by 80.93% relative to state-of-the-art techniques.

Technology Category

Application Category

📝 Abstract

HPC systems expose many configuration parameters that jointly drive competing objectives. Existing tools such as autotuners recommend good configurations but do not identify minimal changes for a near-miss configuration to meet a performance objective, and they often ignore domain-specific constraints. To address this gap, we introduce COMPASS -- a modular, programmable engine that uses operational traces to generate HPC configuration recommendations and guide tuning decisions. This paper: (1) formalizes configuration questions into query patterns; (2) develops an interactive decision-making engine that formulates these queries as Machine Learning (ML) tasks; (3) quantifies the trustworthiness of its recommendations by providing evidence and quantifying uncertainty, and -- when confidence is low -- provides guidance on which configurations to run next. We validate COMPASS using analytical ground truth, reconstruction accuracy, reproduction of published findings, and when possible, running on real hardware. When integrated with an open-source HPC scheduling simulator, COMPASS cuts average job turnaround time by 65.93% and node usage by 80.93% relative to the state-of-the-art. Moreover, COMPASS achieves up to 100x faster training and 80x faster inference than state-of-the-art generative methods, and scales to traces with 1.3B samples and 126GB of data.

Problem

Research questions and friction points this paper is trying to address.

HPC configuration

performance trade-off

autotuning

domain-specific constraints

decision intelligence

Innovation

Methods, ideas, or system contributions that make the work stand out.

decision intelligence

configuration tuning

machine learning for HPC

uncertainty quantification

performance trade-off

🔎 Similar Papers

No similar papers found.