Parallel Data Object Creation: Towards Scalable Metadata Management in High-Performance I/O Library

📅 2025-06-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Parallel I/O libraries such as PnetCDF suffer from severe performance bottlenecks when creating massive numbers of independent data objects, due to collective metadata consistency constraints. Method: This paper proposes a two-segment file header format—comprising a centralized index table and distributed metadata blocks—that enables asynchronous, process-independent data object creation while preserving full netCDF format compatibility. Built atop PnetCDF, the new scalable I/O library employs a lightweight indexing mechanism, decentralized metadata block management, and optimized file semantics. Contribution/Results: Experiments on 4,096 MPI processes creating 5.68 million data objects demonstrate up to 582× reduction in creation time. Per-process memory overhead decreases inversely with the number of processes, significantly enhancing I/O scalability and efficiency for heterogeneous scientific data in large-scale computing environments.

Technology Category

Application Category

📝 Abstract
High-level I/O libraries, such as HDF5 and PnetCDF, are commonly used by large-scale scientific applications to perform I/O tasks in parallel. These I/O libraries store the metadata such as data types and dimensionality along with the raw data in the same files. While these libraries are well-optimized for concurrent access to the raw data, they are designed neither to handle a large number of data objects efficiently nor to create different data objects independently by multiple processes, as they require applications to call data object creation APIs collectively with consistent metadata among all processes. Applications that process data gathered from remote sensors, such as particle collision experiments in high-energy physics, may generate data of different sizes from different sensors and desire to store them as separate data objects. For such applications, the I/O library's requirement on collective data object creation can become very expensive, as the cost of metadata consistency check increases with the metadata volume as well as the number of processes. To address this limitation, using PnetCDF as an experimental platform, we investigate solutions in this paper that abide the netCDF file format, as well as propose a new file header format that enables independent data object creation. The proposed file header consists of two sections, an index table and a list of metadata blocks. The index table contains the reference to the metadata blocks and each block stores metadata of objects that can be created collectively or independently. The new design achieves a scalable performance, cutting data object creation times by up to 582x when running on 4096 MPI processes to create 5,684,800 data objects in parallel. Additionally, the new method reduces the memory footprints, with each process requiring an amount of memory space inversely proportional to the number of processes.
Problem

Research questions and friction points this paper is trying to address.

Enables independent parallel creation of numerous data objects
Reduces metadata consistency check costs in high-performance I/O
Improves scalability and memory efficiency for large-scale applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

New file header format enables independent object creation
Index table and metadata blocks improve scalability
Reduces creation time and memory footprint significantly
🔎 Similar Papers
No similar papers found.
Y
Youjia Li
Department of Electrical and Computer Engineering, Northwestern University
R
Robert Latham
Argonne National Laboratory
Robert B. Ross
Robert B. Ross
Senior Computer Scientist, Mathematics and Computer Science Division, Argonne National Laboratory
high-performance computingHPC system softwareparallel I/O and file systems
A
Ankit Agrawal
Department of Electrical and Computer Engineering, Northwestern University
A
Alok N. Choudhary
Department of Electrical and Computer Engineering, Northwestern University
W
Wei-Keng Liao
Department of Electrical and Computer Engineering, Northwestern University