Intelligent Sampling of Extreme-Scale Turbulence Datasets for Accurate and Efficient Spatiotemporal Model Training

📅 2025-08-05

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

Under the breakdown of Moore’s Law and Dennard scaling, the high dimensionality of extreme-scale turbulent flow data severely impedes both training efficiency and model accuracy. To address this, we propose SICKLE—a Sparse Intelligent data CLearing and sElection framework—that introduces a Maximum Entropy (MaxEnt) subsampling strategy to replace conventional random or phase-space sampling. This strategy significantly reduces training data volume while enhancing model generalization. SICKLE integrates three core components: sparse data selection, a scalable distributed training architecture, and fine-grained energy consumption benchmarking. Evaluated on large-scale direct numerical simulation (DNS) turbulence datasets deployed on the Frontier exascale supercomputer, experimental results demonstrate that MaxEnt subsampling improves model accuracy by up to 12.7% and reduces training energy consumption by as much as 38×. This work is the first to empirically validate that intelligent preprocessing-stage subsampling can jointly optimize both predictive accuracy and computational energy efficiency.

Technology Category

Application Category

📝 Abstract

With the end of Moore's law and Dennard scaling, efficient training increasingly requires rethinking data volume. Can we train better models with significantly less data via intelligent subsampling? To explore this, we develop SICKLE, a sparse intelligent curation framework for efficient learning, featuring a novel maximum entropy (MaxEnt) sampling approach, scalable training, and energy benchmarking. We compare MaxEnt with random and phase-space sampling on large direct numerical simulation (DNS) datasets of turbulence. Evaluating SICKLE at scale on Frontier, we show that subsampling as a preprocessing step can improve model accuracy and substantially lower energy consumption, with reductions of up to 38x observed in certain cases.

Problem

Research questions and friction points this paper is trying to address.

Intelligent subsampling for efficient turbulence model training

MaxEnt sampling vs random/phase-space on DNS datasets

Reducing energy use in training via preprocessing techniques

Innovation

Methods, ideas, or system contributions that make the work stand out.

Maximum entropy sampling for efficient learning

Scalable training on extreme-scale turbulence datasets

Energy benchmarking with significant consumption reduction

🔎 Similar Papers

HR-Extreme: A High-Resolution Dataset for Extreme Weather Forecasting