🤖 AI Summary
Lossy compression often distorts topological features—such as critical points and Morse–Smale complexes—thereby compromising the accuracy of scientific analysis. This work proposes the first distributed parallel algorithm scalable to multiple GPUs for efficiently correcting piecewise-linear Morse–Smale complexes after compression. By preserving the steepest ascent and descent directions at each point, the method avoids explicit integral line computation. Combined with relaxed synchronization and communication optimization, it significantly reduces computational overhead. Evaluated on the Perlmutter supercomputer using 128 GPUs on real-world datasets, the approach achieves over 90% parallel efficiency, surpassing single-GPU limitations and substantially enhancing scalability for extreme-scale data.
📝 Abstract
Lossy compression, widely used by scientists to reduce data from simulations, experiments, and observations, can distort features of interest even under bounded error. Such distortions may compromise downstream analyses and lead to incorrect scientific conclusions in applications such as combustion and cosmology. This paper presents a distributed and parallel algorithm for correcting topological features, specifically, piecewise linear Morse Smale segmentations (PLMSS), which decompose the domain into monotone regions labeled by their corresponding local minima and maxima. While a single GPU algorithm (MSz) exists for PLMSS correction after compression, no methodology has been developed that scales beyond a single GPU for extreme scale data. We identify the key bottleneck in scaling PLMSS correction as the parallel computation of integral paths, a communication-intensive computation that is notoriously difficult to scale. Instead of explicitly computing and correcting integral paths, our algorithm simplifies MSz by preserving steepest ascending and descending directions across all locations, thereby minimizing interprocess communication while introducing negligible additional storage overhead. With this simplified algorithm and relaxed synchronization, our method achieves over 90% parallel efficiency on 128 GPUs on the Perlmutter supercomputer for real world datasets.