CARAT: Client-Side Adaptive RPC and Cache Co-Tuning for Parallel File Systems

๐Ÿ“… 2026-02-25
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the challenge of efficiently tuning parallel file systems in high-performance computing environments, where complex I/O paths, diverse access patterns, and dynamically changing system states hinder optimal performance. To this end, we propose CARAT, a lightweight and scalable framework that enables client-side, online, self-adaptive tuning without relying on global system information or predefined I/O patterns. CARAT leverages locally observable metrics and employs a machine learningโ€“guided adaptive algorithm to jointly optimize RPC and caching parameters on Lustre clients, dynamically responding to variations in application I/O behavior and system conditions. Evaluated under diverse dynamic I/O workloads and real-world HPC applications, CARAT achieves up to a 3ร— performance improvement over default or static configurations, demonstrating its effectiveness and robustness.

Technology Category

Application Category

๐Ÿ“ Abstract
Tuning parallel file system in High-Performance Computing (HPC) systems remains challenging due to the complex I/O paths, diverse I/O patterns, and dynamic system conditions. While existing autotuning frameworks have shown promising results in tuning PFS parameters based on applications' I/O patterns, they lack scalability, adaptivity, and the ability to operate online. In this work, focusing on scalable online tuning, we present CARAT, an ML-guided framework to co-tune client-side RPC and caching parameters of PFS, leveraging only locally observable metrics. Unlike global or pattern-dependent approaches, CARAT enables each client to make independent and intelligent tuning decisions online, responding to real-time changes in both application I/O behaviors and system states. We then prototyped CARAT using Lustre and evaluated it extensively across dynamic I/O patterns, real-world HPC workloads, and multi-client deployments. The results demonstrated that CARAT can achieve up to 3x performance improvement over the default or static configurations, validating the effectiveness and generality of our approach. Due to its scalability and lightweight, we believe CARAT has the potential to be widely deployed into existing PFS and benefit various data-intensive applications.
Problem

Research questions and friction points this paper is trying to address.

parallel file systems
autotuning
HPC
I/O optimization
scalability
Innovation

Methods, ideas, or system contributions that make the work stand out.

adaptive tuning
client-side co-tuning
parallel file systems
online optimization
machine learning-guided
๐Ÿ”Ž Similar Papers
No similar papers found.