Performance-Driven Optimization of Parallel Breadth-First Search

📅 2025-03-01

📈 Citations: 0

✨ Influential: 0

career value

266K/year

🤖 AI Summary

Parallel breadth-first search (BFS) on multicore systems suffers from irregular memory access patterns, load imbalance, and synchronization overhead. This paper proposes a hardware- and graph-aware parallel BFS method that jointly optimizes for both architectural features and graph structural properties. We introduce a novel non-atomic distance update mechanism, integrated with a hybrid traversal strategy and a compact bitmapped visited set, enabling adaptive optimization to both graph diameter and CPU microarchitecture (Intel Xeon/AMD EPYC). Experimental results demonstrate 3–10× speedup on small-diameter graphs. Moreover, we systematically characterize the strong sensitivity of optimization efficacy to both graph structure and hardware characteristics, and formally establish performance trade-off boundaries for large-diameter graphs. Our work establishes a new paradigm for co-designing graph algorithms with underlying hardware and input graph topology.

Technology Category

Application Category

📝 Abstract

Breadth-first search (BFS) is a fundamental graph algorithm that presents significant challenges for parallel implementation due to irregular memory access patterns, load imbalance and synchronization overhead. In this paper, we introduce a set of optimization strategies for parallel BFS on multicore systems, including hybrid traversal, bitmap-based visited set, and a novel non-atomic distance update mechanism. We evaluate these optimizations across two different architectures - a 24-core Intel Xeon platform and a 128-core AMD EPYC system - using a diverse set of synthetic and real-world graphs. Our results demonstrate that the effectiveness of optimizations varies significantly based on graph characteristics and hardware architecture. For small-diameter graphs, our hybrid BFS implementation achieves speedups of 3-8x on the Intel platform and $3-10 imes$ on the AMD system compared to a conventional parallel BFS implementation. However, the performance of large-diameter graphs is more nuanced, with some of the optimizations showing varied performance across platforms including performance degradation in some cases.

Problem

Research questions and friction points this paper is trying to address.

Optimizing parallel BFS for multicore systems

Addressing irregular memory access and load imbalance

Evaluating optimizations across different graph types and architectures

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid traversal for parallel BFS optimization

Bitmap-based visited set for memory efficiency

Non-atomic distance update mechanism for synchronization

🔎 Similar Papers

Separate Generation and Evaluation for Parallel Greedy Best-First Search

2024-08-11arXiv.orgCitations: 0

ByteDance

圣何塞

Multimodal Model Training and Inference Optimization Engineer

ByteDance

西雅图

Research Scientist