π€ AI Summary
This work addresses the significant slowdown of GPU architecture simulation compared to native execution and the limitations of existing sampling techniques that rely on handcrafted features, which struggle to balance accuracy and efficiency. To overcome these challenges, the study introduces graph contrastive learning for GPU workload samplingβa novel approach that constructs trace graphs capturing instruction sequences and data dependencies, and employs a relational graph convolutional network to automatically uncover high-dimensional semantic and structural similarities among kernels. This method transcends the representational constraints of traditional handcrafted features. Extensive benchmark evaluations demonstrate that the proposed technique achieves an average speedup of 258.94Γ with only 0.37% error, substantially outperforming state-of-the-art methods such as PKA, Sieve, and STEM+ROOT.
π Abstract
GPU architectural simulation is orders of magnitude slower than native execution, necessitating workload sampling for practical speedups. Existing methods rely on hand-crafted features with limited expressiveness, yielding either aggressive sampling with high errors or conservative sampling with constrained speedups. To address these issues, we propose GCL-Sampler, a sampling framework that leverages Relational Graph Convolutional Networks with contrastive learning to automatically discover high-dimensional kernel similarities from trace graphs. By encoding instruction sequences and data dependencies into graph embeddings, GCL-Sampler captures rich structural and semantic properties of program execution, enabling both high fidelity and substantial speedup. Evaluations on extensive benchmarks show that GCL-Sampler achieves 258.94x average speedup against full workload with 0.37% error, outperforming state-of-the-art methods, PKA (129.23x, 20.90%), Sieve (94.90x, 4.10%) and STEM+ROOT (56.57x, 0.38%).