IOAgent: Democratizing Trustworthy HPC I/O Performance Diagnosis Capability via LLMs

📅 2026-02-25

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

This work addresses the growing complexity of high-performance computing (HPC) storage stacks, which hinders domain scientists from independently diagnosing I/O performance bottlenecks and forces reliance on scarce I/O experts. To bridge this gap, we present the first integration of large language models (LLMs) into HPC I/O performance diagnosis, introducing a modular architecture that parses Darshan trace files via a preprocessing module, augments domain knowledge through retrieval-augmented generation (RAG), and employs a tree-based merging strategy to produce interpretable and interactive diagnostic feedback. Our approach effectively mitigates challenges related to long-context processing, insufficient domain knowledge, and hallucination, while remaining compatible with both open- and closed-source LLMs. We also release TraceBench, the first open-source benchmark for HPC I/O diagnostics, and demonstrate through experiments that our system matches or exceeds existing tools in accuracy and usability, significantly empowering scientists to conduct autonomous performance analysis.

Technology Category

Application Category

📝 Abstract

As the complexity of the HPC storage stack rapidly grows, domain scientists face increasing challenges in effectively utilizing HPC storage systems to achieve their desired I/O performance. To identify and address I/O issues, scientists largely rely on I/O experts to analyze their I/O traces and provide insights into potential problems. However, with a limited number of I/O experts and the growing demand for data-intensive applications, inaccessibility has become a major bottleneck, hindering scientists from maximizing their productivity. Rapid advances in LLMs make it possible to build an automated tool that brings trustworthy I/O performance diagnosis to domain scientists. However, key challenges remain, such as the inability to handle long context windows, a lack of accurate domain knowledge about HPC I/O, and the generation of hallucinations during complex interactions.In this work, we propose IOAgent as a systematic effort to address these challenges. IOAgent integrates a module-based pre-processor, a RAG-based domain knowledge integrator, and a tree-based merger to accurately diagnose I/O issues from a given Darshan trace file. Similar to an I/O expert, IOAgent provides detailed justifications and references for its diagnoses and offers an interactive interface for scientists to ask targeted follow-up questions. To evaluate IOAgent, we collected a diverse set of labeled job traces and released the first open diagnosis test suite, TraceBench. Using this test suite, we conducted extensive evaluations, demonstrating that IOAgent matches or outperforms state-of-the-art I/O diagnosis tools with accurate and useful diagnosis results. We also show that IOAgent is not tied to specific LLMs, performing similarly well with both proprietary and open-source LLMs. We believe IOAgent has the potential to become a powerful tool for scientists navigating complex HPC I/O subsystems in the future.

Problem

Research questions and friction points this paper is trying to address.

HPC I/O performance

performance diagnosis

domain scientists

storage stack complexity

expert bottleneck

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based diagnosis

HPC I/O performance

Retrieval-Augmented Generation (RAG)