🤖 AI Summary
This study addresses the insufficient AI-readiness of cross-domain scientific data—spanning climate, nuclear fusion, bio/health, and materials science—in foundational model training. We propose a two-dimensional AI-readiness framework tailored for high-performance computing (HPC) environments. The framework innovatively integrates *readiness maturity levels* with *data processing stages*, forming a standardized maturity matrix that systematically characterizes the transformation pathway from raw data to AI-ready data. By analyzing common preprocessing patterns and domain-specific constraints across these four scientific domains, we integrate HPC-optimized preprocessing techniques—such as scalable I/O, distributed format conversion, and memory-efficient feature extraction—to support generative model training (e.g., Transformers). Our contribution is a reproducible, cross-domain data preparation infrastructure and methodology for scientific AI, enabling systematic, scalable, and domain-aware data readiness. Evaluation demonstrates significant improvements in throughput, interoperability, and readiness consistency across heterogeneous scientific datasets.
📝 Abstract
This paper examines how Data Readiness for AI (DRAI) principles apply to leadership-scale scientific datasets used to train foundation models. We analyze archetypal workflows across four representative domains - climate, nuclear fusion, bio/health, and materials - to identify common preprocessing patterns and domain-specific constraints. We introduce a two-dimensional readiness framework composed of Data Readiness Levels (raw to AI-ready) and Data Processing Stages (ingest to shard), both tailored to high performance computing (HPC) environments. This framework outlines key challenges in transforming scientific data for scalable AI training, emphasizing transformer-based generative models. Together, these dimensions form a conceptual maturity matrix that characterizes scientific data readiness and guides infrastructure development toward standardized, cross-domain support for scalable and reproducible AI for science.