🤖 AI Summary
Traditional GPU-accelerated volumetric data structures overemphasize data locality while neglecting critical factors such as thread occupancy, inter-GPU communication overhead, and kernel fusion. To address this, we propose a decoupled volumetric architecture that shifts the design paradigm from “locality-first” to multi-objective co-optimization of memory access patterns, thread occupancy, communication scheduling, and kernel fusion. Our architecture natively supports dense, block-sparse, and multi-resolution representations, enabling dynamic adaptation to complex geometries and heterogeneous data distributions. By integrating customized sharding-based communication strategies and automated fused-kernel generation, we achieve up to 3× speedup on single-node multi-GPU platforms for Lattice Boltzmann Method (LBM) fluid simulations. Experimental results demonstrate significant reductions in register pressure and communication overhead, alongside improved kernel fusion efficiency—validating both robustness and generalizability across diverse volumetric workloads.
📝 Abstract
Volumetric data structures are traditionally optimized for data locality, with a primary focus on efficient memory access patterns in computational tasks. However, prioritizing data locality alone can overlook other critical factors necessary for optimal performance, e.g., occupancy, communication, and kernel fusion. We propose a novel disaggregated design approach that rebalances the trade-offs between data locality and these essential objectives. This includes reducing communication overhead in distributed memory architectures, mitigating the impact of register pressure in complex boundary conditions for fluid simulation, and increasing opportunities for kernel fusion. We present a comprehensive analysis of the benefits of our disaggregated design, applied to a fluid solver based on the Lattice Boltzmann Method (LBM) and deployed on a single-node multi-GPU system. Our evaluation spans various discretizations, ranging from dense to block-sparse and multi-resolution representations, highlighting the flexibility and efficiency of the disaggregated design across diverse use cases. Leveraging the disaggregated design, we showcase how we target different optimization objectives that result in up to a $3 imes$ speedup compared to state-of-the-art solutions.