๐ค AI Summary
This work addresses the challenge of efficiently tuning parallel file systems in high-performance computing environments, where complex I/O paths, diverse access patterns, and dynamically changing system states hinder optimal performance. To this end, we propose CARAT, a lightweight and scalable framework that enables client-side, online, self-adaptive tuning without relying on global system information or predefined I/O patterns. CARAT leverages locally observable metrics and employs a machine learningโguided adaptive algorithm to jointly optimize RPC and caching parameters on Lustre clients, dynamically responding to variations in application I/O behavior and system conditions. Evaluated under diverse dynamic I/O workloads and real-world HPC applications, CARAT achieves up to a 3ร performance improvement over default or static configurations, demonstrating its effectiveness and robustness.
๐ Abstract
Tuning parallel file system in High-Performance Computing (HPC) systems remains challenging due to the complex I/O paths, diverse I/O patterns, and dynamic system conditions. While existing autotuning frameworks have shown promising results in tuning PFS parameters based on applications' I/O patterns, they lack scalability, adaptivity, and the ability to operate online. In this work, focusing on scalable online tuning, we present CARAT, an ML-guided framework to co-tune client-side RPC and caching parameters of PFS, leveraging only locally observable metrics. Unlike global or pattern-dependent approaches, CARAT enables each client to make independent and intelligent tuning decisions online, responding to real-time changes in both application I/O behaviors and system states. We then prototyped CARAT using Lustre and evaluated it extensively across dynamic I/O patterns, real-world HPC workloads, and multi-client deployments. The results demonstrated that CARAT can achieve up to 3x performance improvement over the default or static configurations, validating the effectiveness and generality of our approach. Due to its scalability and lightweight, we believe CARAT has the potential to be widely deployed into existing PFS and benefit various data-intensive applications.