Performance-Driven Optimization of Parallel Breadth-First Search

📅 2025-03-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Parallel breadth-first search (BFS) on multicore systems suffers from irregular memory access patterns, load imbalance, and synchronization overhead. This paper proposes a hardware- and graph-aware parallel BFS method that jointly optimizes for both architectural features and graph structural properties. We introduce a novel non-atomic distance update mechanism, integrated with a hybrid traversal strategy and a compact bitmapped visited set, enabling adaptive optimization to both graph diameter and CPU microarchitecture (Intel Xeon/AMD EPYC). Experimental results demonstrate 3–10× speedup on small-diameter graphs. Moreover, we systematically characterize the strong sensitivity of optimization efficacy to both graph structure and hardware characteristics, and formally establish performance trade-off boundaries for large-diameter graphs. Our work establishes a new paradigm for co-designing graph algorithms with underlying hardware and input graph topology.

Technology Category

Application Category

📝 Abstract
Breadth-first search (BFS) is a fundamental graph algorithm that presents significant challenges for parallel implementation due to irregular memory access patterns, load imbalance and synchronization overhead. In this paper, we introduce a set of optimization strategies for parallel BFS on multicore systems, including hybrid traversal, bitmap-based visited set, and a novel non-atomic distance update mechanism. We evaluate these optimizations across two different architectures - a 24-core Intel Xeon platform and a 128-core AMD EPYC system - using a diverse set of synthetic and real-world graphs. Our results demonstrate that the effectiveness of optimizations varies significantly based on graph characteristics and hardware architecture. For small-diameter graphs, our hybrid BFS implementation achieves speedups of 3-8x on the Intel platform and $3-10 imes$ on the AMD system compared to a conventional parallel BFS implementation. However, the performance of large-diameter graphs is more nuanced, with some of the optimizations showing varied performance across platforms including performance degradation in some cases.
Problem

Research questions and friction points this paper is trying to address.

Optimizing parallel BFS for multicore systems
Addressing irregular memory access and load imbalance
Evaluating optimizations across different graph types and architectures
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid traversal for parallel BFS optimization
Bitmap-based visited set for memory efficiency
Non-atomic distance update mechanism for synchronization
🔎 Similar Papers
No similar papers found.