🤖 AI Summary
In resource-decoupled HPC systems, remote memory access incurs substantial latency, necessitating efficient quantification of application sensitivity to memory latency and memory-level parallelism (MLP). Existing approaches rely on custom hardware or cycle-accurate simulation, suffering from poor flexibility and high overhead. This paper introduces the first lightweight, runtime instruction-trace-based framework that constructs an execution directed acyclic graph (DAG) and—uniquely—integrates DAG critical-path analysis with memory access pattern modeling to theoretically bound latency sensitivity and MLP. The framework is portable across diverse hardware configurations. Evaluated on PolyBench, HPCG, and LULESH, it achieves prediction errors under 8% for performance bounds while accelerating analysis by three orders of magnitude compared to cycle-accurate simulation.
📝 Abstract
Resource disaggregation is a promising technique for improving the efficiency of large-scale computing systems. However, this comes at the cost of increased memory access latency due to the need to rely on the network fabric to transfer data between remote nodes. As such, it is crucial to ascertain an application's memory latency sensitivity to minimize the overall performance impact. Existing tools for measuring memory latency sensitivity often rely on custom ad-hoc hardware or cycle-accurate simulators, which can be inflexible and time-consuming. To address this, we present EDAN (Execution DAG Analyzer), a novel performance analysis tool that leverages an application's runtime instruction trace to generate its corresponding execution DAG. This approach allows us to estimate the latency sensitivity of sequential programs and investigate the impact of different hardware configurations. EDAN not only provides us with the capability of calculating the theoretical bounds for performance metrics, but it also helps us gain insight into the memory-level parallelism inherent to HPC applications. We apply EDAN to applications and benchmarks such as PolyBench, HPCG, and LULESH to unveil the characteristics of their intrinsic memory-level parallelism and latency sensitivity.