🤖 AI Summary
In shared HPC environments, storage resource contention induces I/O congestion, causing performance volatility, task delays, and timeouts. To address this, we propose a control-theoretic, adaptive client-side I/O rate regulation method—the first to embed feedback control into the client scheduling layer of shared storage systems. Our approach relies solely on lightweight, runtime load metrics (e.g., I/O latency, queue length), enabling parameter-free, workload-agnostic congestion control without manual tuning. Evaluation on multi-node clusters demonstrates up to 20% reduction in total execution time, substantial tail-latency improvement, and significantly enhanced system stability and predictability over baseline schedulers. The core contribution is the first closed-loop control framework specifically designed for I/O congestion in shared storage systems—rigorously grounded in control theory yet practical for real-world deployment.
📝 Abstract
Efficient data access in High-Performance Computing (HPC) systems is essential to the performance of intensive computing tasks. Traditional optimizations of the I/O stack aim to improve peak performance but are often workload specific and require deep expertise, making them difficult to generalize or re-use. In shared HPC environments, resource congestion can lead to unpredictable performance, causing slowdowns and timeouts. To address these challenges, we propose a self-adaptive approach based on Control Theory to dynamically regulate client-side I/O rates. Our approach leverages a small set of runtime system load metrics to reduce congestion and enhance performance stability. We implement a controller in a multi-node cluster and evaluate it on a real testbed under a representative workload. Experimental results demonstrate that our method effectively mitigates I/O congestion, reducing total runtime by up to 20% and lowering tail latency, while maintaining stable performance.