🤖 AI Summary
Assessing the suitability of public clouds (Azure, AWS, GCP) versus on-premises HPC clusters for scientific workflows—specifically regarding portability, reproducibility, dynamism, and automation—remains challenging due to fragmented evaluation criteria.
Method: We conduct a cross-platform empirical study across six environments (CPU/GPU configurations), scaling up to 28,672 CPU cores or 256 GPUs, using 11 representative HPC proxy applications and benchmarks (e.g., HPL, HPCG, IO500). Our methodology integrates containerization, multi-cloud orchestration, and observability-driven monitoring.
Contribution/Results: We introduce the first HPC Cloud Readiness Assessment Framework, quantifying platform disparities in latency, strong/weak scaling efficiency, I/O throughput, and cost-effectiveness. The framework establishes a standardized methodology for cloud migration evaluation, delivering actionable insights and decision support for cloud-native HPC adoption.
📝 Abstract
The rise of AI and the economic dominance of cloud computing have created a new nexus of innovation for high performance computing (HPC), which has a long history of driving scientific discovery. In addition to performance needs, scientific workflows increasingly demand capabilities of cloud environments: portability, reproducibility, dynamism, and automation. As converged cloud environments emerge, there is growing need to study their fit for HPC use cases. Here we present a cross-platform usability study that assesses 11 different HPC proxy applications and benchmarks across three clouds (Microsoft Azure, Amazon Web Services, and Google Cloud), six environments, and two compute configurations (CPU and GPU) against on-premises HPC clusters at a major center. We perform scaling tests of applications in all environments up to 28,672 CPUs and 256 GPUs. We present methodology and results to guide future study and provide a foundation to define best practices for running HPC workloads in cloud.