Mitigating Shared Storage Congestion Using Control Theory

📅 2025-11-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In shared HPC environments, storage resource contention induces I/O congestion, causing performance volatility, task delays, and timeouts. To address this, we propose a control-theoretic, adaptive client-side I/O rate regulation method—the first to embed feedback control into the client scheduling layer of shared storage systems. Our approach relies solely on lightweight, runtime load metrics (e.g., I/O latency, queue length), enabling parameter-free, workload-agnostic congestion control without manual tuning. Evaluation on multi-node clusters demonstrates up to 20% reduction in total execution time, substantial tail-latency improvement, and significantly enhanced system stability and predictability over baseline schedulers. The core contribution is the first closed-loop control framework specifically designed for I/O congestion in shared storage systems—rigorously grounded in control theory yet practical for real-world deployment.

Technology Category

Application Category

📝 Abstract
Efficient data access in High-Performance Computing (HPC) systems is essential to the performance of intensive computing tasks. Traditional optimizations of the I/O stack aim to improve peak performance but are often workload specific and require deep expertise, making them difficult to generalize or re-use. In shared HPC environments, resource congestion can lead to unpredictable performance, causing slowdowns and timeouts. To address these challenges, we propose a self-adaptive approach based on Control Theory to dynamically regulate client-side I/O rates. Our approach leverages a small set of runtime system load metrics to reduce congestion and enhance performance stability. We implement a controller in a multi-node cluster and evaluate it on a real testbed under a representative workload. Experimental results demonstrate that our method effectively mitigates I/O congestion, reducing total runtime by up to 20% and lowering tail latency, while maintaining stable performance.
Problem

Research questions and friction points this paper is trying to address.

Mitigating shared storage congestion in HPC systems
Dynamically regulating client-side I/O rates using control theory
Reducing runtime and tail latency through adaptive control
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses control theory for dynamic I/O rate regulation
Leverages runtime system load metrics to reduce congestion
Implements controller in cluster to mitigate storage congestion
🔎 Similar Papers
No similar papers found.
T
Thomas Collignon
Qarnot Computing, Univ. Lille, Inria, CNRS, Centrale Lille, UMR 9189 CRIStAL
K
Kouds Halitim
Univ. Grenoble Alpes, Inria, CNRS, LIG
Raphaël Bleuse
Raphaël Bleuse
Univ. Grenoble Alpes, Inria, CNRS, LIG
Sophie Cerf
Sophie Cerf
INRIA
Control theoryautonomic computingdistributed systems
Bogdan Robu
Bogdan Robu
Univ. Grenoble Alpes, CNRS, Grenoble INP, GIPSA-lab
É
Éric Rutten
Univ. Grenoble Alpes, Inria, CNRS, LIG
Lionel Seinturier
Lionel Seinturier
Univ. Lille, Inria, CNRS, Centrale Lille, UMR 9189 CRIStAL
A
Alexandre van Kempen
Qarnot Computing