Fast Algorithms for Scheduling Many-body Correlation Functions on Accelerators

📅 2025-11-04

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

High memory consumption and GPU–host data transfer bottlenecks severely hinder the computation of many-body correlation functions in lattice quantum chromodynamics (LQCD). Method: This work proposes a scheduling optimization framework tailored for binary batched tensor contractions, integrating two novel algorithms that jointly exploit contraction-order and tree-structure locality to enhance temporal locality of input/intermediate tensors, maximize memory reuse, and enable fine-grained dataflow control—implemented within the Redstar analysis framework. Contribution/Results: Experiments demonstrate a 2.1× reduction in peak memory usage, a 4.2× decrease in cache evictions, a 1.8× reduction in GPU–host data transfers, and a 1.9× speedup in end-to-end computation. The proposed scheduling paradigm provides a scalable, GPU-accelerated solution for large-scale evaluation of high-order LQCD correlation functions.

Technology Category

Application Category

📝 Abstract

Computation of correlation functions is a key operation in Lattice quantum chromodynamics (LQCD) simulations to extract nuclear physics observables. These functions involve many binary batch tensor contractions, each tensor possibly occupying hundreds of MBs of memory. Performing these contractions on GPU accelerators poses the challenge of scheduling them as to optimize tensor reuse and reduce data traffic. In this work we propose two fast novel scheduling algorithms that reorder contractions to increase temporal locality via input/intermediate tensor reuse. Our schedulers take advantage of application-specific features, such as contractions being binary and locality within contraction trees, to optimize the objective of minimizing peak memory. We integrate them into the LQCD analysis software suite Redstar and improve time-to-solution. Our schedulers attain upto 2.1x improvement in peak memory, which is reflected by a reduction of upto 4.2x in evictions, upto 1.8x in data traffic, resulting in upto 1.9x faster correlation function computation time.

Problem

Research questions and friction points this paper is trying to address.

Optimizing tensor reuse in many-body correlation function computations

Reducing data traffic for binary batch tensor contractions on GPUs

Minimizing peak memory usage in LQCD simulation scheduling algorithms

Innovation

Methods, ideas, or system contributions that make the work stand out.

Novel scheduling algorithms reorder contractions for temporal locality

Schedulers optimize peak memory using binary contraction trees

Integration into LQCD software reduces data traffic and evictions

🔎 Similar Papers

ForestColl: Throughput-Optimal Collective Communications on Heterogeneous Network Fabrics