🤖 AI Summary
Existing post-hoc quality assessment (e.g., PSNR, L2 norm) for bounded-loss compression of scientific time-series data incurs high computational overhead and lacks real-time feedback capability.
Method: We propose the first general-purpose deep surrogate model tailored for time-series scientific data. Our approach employs a two-stage decoupled architecture that separates computationally expensive compression feature extraction from lightweight quality metric prediction. To enhance temporal robustness and generalization, we introduce a time-aware Mixture-of-Experts (MoE) mechanism. The model is trained end-to-end across multiple compressors, diverse quality metrics, and heterogeneous scientific datasets.
Contribution/Results: Evaluated on four real-world scientific applications, our model achieves prediction errors consistently below 10%, significantly outperforming state-of-the-art methods. It enables on-the-fly, demand-driven optimization of compression parameters—reducing both I/O and computational overhead substantially.
📝 Abstract
Error-bounded lossy compression techniques have become vital for scientific data management and analytics, given the ever-increasing volume of data generated by modern scientific simulations and instruments. Nevertheless, assessing data quality post-compression remains computationally expensive due to the intensive nature of metric calculations. In this work, we present a general-purpose deep-surrogate framework for lossy compression quality prediction (DeepCQ), with the following key contributions: 1) We develop a surrogate model for compression quality prediction that is generalizable to different error-bounded lossy compressors, quality metrics, and input datasets; 2) We adopt a novel two-stage design that decouples the computationally expensive feature-extraction stage from the light-weight metrics prediction, enabling efficient training and modular inference; 3) We optimize the model performance on time-evolving data using a mixture-of-experts design. Such a design enhances the robustness when predicting across simulation timesteps, especially when the training and test data exhibit significant variation. We validate the effectiveness of DeepCQ on four real-world scientific applications. Our results highlight the framework's exceptional predictive accuracy, with prediction errors generally under 10% across most settings, significantly outperforming existing methods. Our framework empowers scientific users to make informed decisions about data compression based on their preferred data quality, thereby significantly reducing I/O and computational overhead in scientific data analysis.