Mitigating Shared Storage Congestion Using Control Theory

📅 2025-11-20

📈 Citations: 0

✨ Influential: 0

career value

259K/year

🤖 AI Summary

In shared HPC environments, storage resource contention induces I/O congestion, causing performance volatility, task delays, and timeouts. To address this, we propose a control-theoretic, adaptive client-side I/O rate regulation method—the first to embed feedback control into the client scheduling layer of shared storage systems. Our approach relies solely on lightweight, runtime load metrics (e.g., I/O latency, queue length), enabling parameter-free, workload-agnostic congestion control without manual tuning. Evaluation on multi-node clusters demonstrates up to 20% reduction in total execution time, substantial tail-latency improvement, and significantly enhanced system stability and predictability over baseline schedulers. The core contribution is the first closed-loop control framework specifically designed for I/O congestion in shared storage systems—rigorously grounded in control theory yet practical for real-world deployment.

Technology Category

Application Category

📝 Abstract

Efficient data access in High-Performance Computing (HPC) systems is essential to the performance of intensive computing tasks. Traditional optimizations of the I/O stack aim to improve peak performance but are often workload specific and require deep expertise, making them difficult to generalize or re-use. In shared HPC environments, resource congestion can lead to unpredictable performance, causing slowdowns and timeouts. To address these challenges, we propose a self-adaptive approach based on Control Theory to dynamically regulate client-side I/O rates. Our approach leverages a small set of runtime system load metrics to reduce congestion and enhance performance stability. We implement a controller in a multi-node cluster and evaluate it on a real testbed under a representative workload. Experimental results demonstrate that our method effectively mitigates I/O congestion, reducing total runtime by up to 20% and lowering tail latency, while maintaining stable performance.

Problem

Research questions and friction points this paper is trying to address.

Mitigating shared storage congestion in HPC systems

Dynamically regulating client-side I/O rates using control theory

Reducing runtime and tail latency through adaptive control

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses control theory for dynamic I/O rate regulation

Leverages runtime system load metrics to reduce congestion

Implements controller in cluster to mitigate storage congestion

🔎 Similar Papers

A Survey on Adversarial Contention Resolution

2024-03-06arXiv.orgCitations: 0

💼 Related Jobs

AI/HPC System Performance Engineer