🤖 AI Summary
To address the GPU memory bandwidth bottleneck that limits deep learning performance, this work pioneers compiler-level operator fusion leveraging NVIDIA H100’s Distributed Shared Memory (DSM) hardware—bypassing traditional constraints of on-chip registers and SM-local shared memory to enable cross-SM fusion of operators with large intermediate tensors (e.g., FFN). We propose a DSM communication abstraction model, a DSM-aware dataflow analyzer, and a unified search framework supporting cost modeling, schedule optimization, and search-space pruning. Evaluated on the H100, our approach reduces memory traffic by 58%, achieves up to 3.3× kernel speedup over highly optimized libraries and 4.1× over state-of-the-art compilers, and delivers 1.24× end-to-end acceleration for representative transformer workloads.
📝 Abstract
The scaling of computation throughput continues to outpace improvements in memory bandwidth, making many deep learning workloads memory-bound. Kernel fusion is a key technique to alleviate this problem, but the fusion strategies of existing compilers and frameworks are limited to using local scratchpad memory. When the intermediate results exceed the limited capacity (such as FFN), the fusion fails. Although modern GPUs (like the NVIDIA H100) now incorporate an inter-core connection mechanism known as Distributed Shared Memory(DSM)--providing a larger, high-bandwidth, and low-latency on-chip memory pool--this hardware potential has yet to be exploited by software frameworks. To bridge this gap, we present FlashFuser, the first compiler framework to utilize inter-core connection for kernel fusion on modern GPUs. FlashFuser extends established fusion techniques to the DSM domain through three core contributions. First, we propose a powerful DSM-based communication abstraction that formalizes complex cluster-based data exchange patterns, such as reduce, shuffle and multiply. Second, we introduce a dataflow analyzer that generalizes loop scheduling, resource mapping, and tile selection to the distributed memory hierarchy; it determines the optimal execution order and tile sizes by quantifying data movement across memory levels. Finally, FlashFuser integrates these components into a unified search engine that employs analytical cost modeling and DSM-aware pruning strategies to efficiently discover the optimal execution plan. Our evaluation on an NVIDIA H100 GPU shows that FlashFuser reduces memory access by 58% and delivers kernel speedups of 3.3x against highly-tuned libraries and 4.1x against state-of-the-art compilers, resulting in a 1.24x end-to-end speedup.