🤖 AI Summary
PB–EB-scale I/O and network bottlenecks arising from HPC scientific simulations severely hinder the deployment of existing DNN–traditional hybrid compression frameworks, which suffer from high memory-access overhead, elevated latency, and poor adaptability to dynamic workloads. Method: This paper proposes the first dataflow-aware hardware architecture for neural–traditional hybrid compression, featuring dataflow-driven scheduling, collaborative pipelining of DNN and traditional algorithms, low-overhead on-chip memory access, and modular parallel processing units. Contribution/Results: Evaluated across multiple datasets and hardware platforms, the architecture achieves 3.50×–96.07× speedup and 24.51×–520.68× energy-efficiency improvement over baseline methods, while demonstrating strong scalability and hardware friendliness.
📝 Abstract
Scientific simulation leveraging high-performance computing (HPC) systems is crucial for modeling complex systems and phenomena in fields such as astrophysics, climate science, and fluid dynamics, generating massive datasets that often reach petabyte to exabyte scales. However, managing these vast data volumes introduces significant I/O and network bottlenecks, limiting practical performance and scalability. While cutting-edge lossy compression frameworks powered by deep neural networks (DNNs) have demonstrated superior compression ratios by capturing complex data correlations, their integration into HPC workflows poses substantial challenges due to the hybrid non-neural and neural computation patterns, causing excessive memory access overhead, large sequential stalls, and limited adaptability to varying data sizes and workloads in existing hardware platforms. To overcome these challenges and push the limit of high-performance scientific computing, we for the first time propose FLARE, a dataflow-aware and scalable hardware architecture for neural-hybrid scientific lossy compression. FLARE minimizes off-chip data access, reduces bubble overhead through efficient dataflow, and adopts a modular design that provides both scalability and flexibility, significantly enhancing throughput and energy efficiency on modern HPC systems. Particularly, the proposed FLARE achieves runtime speedups ranging from $3.50 imes$ to $96.07 imes$, and energy efficiency improvements ranging from $24.51 imes$ to $520.68 imes$, across various datasets and hardware platforms.