🤖 AI Summary
High memory consumption and GPU–host data transfer bottlenecks severely hinder the computation of many-body correlation functions in lattice quantum chromodynamics (LQCD).
Method: This work proposes a scheduling optimization framework tailored for binary batched tensor contractions, integrating two novel algorithms that jointly exploit contraction-order and tree-structure locality to enhance temporal locality of input/intermediate tensors, maximize memory reuse, and enable fine-grained dataflow control—implemented within the Redstar analysis framework.
Contribution/Results: Experiments demonstrate a 2.1× reduction in peak memory usage, a 4.2× decrease in cache evictions, a 1.8× reduction in GPU–host data transfers, and a 1.9× speedup in end-to-end computation. The proposed scheduling paradigm provides a scalable, GPU-accelerated solution for large-scale evaluation of high-order LQCD correlation functions.
📝 Abstract
Computation of correlation functions is a key operation in Lattice quantum chromodynamics (LQCD) simulations to extract nuclear physics observables. These functions involve many binary batch tensor contractions, each tensor possibly occupying hundreds of MBs of memory. Performing these contractions on GPU accelerators poses the challenge of scheduling them as to optimize tensor reuse and reduce data traffic. In this work we propose two fast novel scheduling algorithms that reorder contractions to increase temporal locality via input/intermediate tensor reuse. Our schedulers take advantage of application-specific features, such as contractions being binary and locality within contraction trees, to optimize the objective of minimizing peak memory. We integrate them into the LQCD analysis software suite Redstar and improve time-to-solution. Our schedulers attain upto 2.1x improvement in peak memory, which is reflected by a reduction of upto 4.2x in evictions, upto 1.8x in data traffic, resulting in upto 1.9x faster correlation function computation time.