BLEST: Blazingly Efficient BFS using Tensor Cores

📅 2025-12-26

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

To address the challenge of efficiently accelerating irregular graph breadth-first search (BFS) on GPU tensor cores (TCs), this paper proposes the first TC-accelerated pull-based BFS framework. Our method introduces three key innovations: (1) a Bitwise Virtual Slice Set (BVSS) for warp-level fine-grained load balancing; (2) a dual-strategy graph reordering—compression-aware and bandwidth-aware—to adapt to heterogeneous graph structures; and (3) a bit-operation-aware batched SpMSpV execution pattern that eliminates TC output waste, while fusing kernels and employing lazy updates to reduce synchronization overhead. Evaluated on real-world graph datasets, our approach achieves average speedups of 3.58×, 4.64×, and 4.9× over BerryBees, Gunrock, and GSWITCH, respectively. It significantly improves throughput and memory locality, demonstrating the first effective TC-centric optimization for irregular graph BFS.

Technology Category

Application Category

📝 Abstract

Breadth-First Search (BFS) is a fundamental graph kernel that underpins a wide range of applications. While modern GPUs provide specialised Matrix-Multiply-Accumulate (MMA) units, e.g., Tensor Cores (TC), with extremely high throughput, they target dense operations, making it non-trivial to exploit them for irregular, unstructured graph computations. In particular, fully utilising them for a BFS requires an efficient mapping of the edge operations onto TCs while avoiding redundancy, load imbalance, and synchronisation. We present BLEST, a TC-accelerated framework that reformulates the pull-based BFS pipeline around a bitmap-oriented structure and a carefully engineered execution layout. BLEST introduces Binarised Virtual Slice Sets (BVSS) to enforce warp-level load balancing and to eliminate frontier-oblivious work assignment. To improve both memory efficiency and update locality across diverse graphs, we apply two complementary graph reordering strategies: a compression-oriented ordering for social-like graphs and a bandwidth-reducing ordering for non-social graphs. At the compute level, we develop a batched SpMSpV multiplication pattern that uses the bitwise TC tiles to handle dot products without wasting output entries, thereby reducing the number of required MMA calls. Finally, BLEST combines kernel fusion with a lazy vertex update scheme to reduce host-side synchronisation, mitigate atomic overheads, and improve cache locality. Experiments show that BLEST delivers, on average, $3.58 imes$, $4.64 imes$ and $4.9 imes$ speedup over BerryBees, Gunrock, and GSWITCH, respectively, across a broad set of real-world graphs.

Problem

Research questions and friction points this paper is trying to address.

Efficiently mapping BFS onto GPU Tensor Cores

Balancing load and reducing redundancy in graph computations

Optimizing memory and compute for diverse graph types

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bitmap-oriented BFS pipeline with Tensor Cores

Binarised Virtual Slice Sets for load balancing

Batched SpMSpV multiplication using bitwise TC tiles

🔎 Similar Papers

TopoBenchmarkX: A Framework for Benchmarking Topological Deep Learning