Optimizing Fine-Grained Parallelism Through Dynamic Load Balancing on Multi-Socket Many-Core Systems

📅 2025-02-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address severe performance bottlenecks in GNU OpenMP’s fine-grained tasking on multi-socket manycore systems—caused by runtime synchronization overhead—this paper proposes a lightweight, NUMA-aware parallel runtime optimization framework. It introduces a lock-free XQueue task queue to eliminate centralized scheduler contention; a hybrid distributed tree barrier to reduce synchronization latency; and a dual-strategy lock-free load balancing mechanism—combining local work-stealing with cross-NUMA task migration—to dynamically adapt to load imbalance. Evaluated on the BOTS benchmark, the approach achieves up to 1522.8× single-node speedup over GNU OpenMP; integrating dynamic load balancing further improves fine-grained task performance by up to 4×, with order-of-magnitude gains in specific scenarios. The core contribution lies in the first synergistic integration of lock-free task queues, distributed barriers, and NUMA-aware dual-mode load balancing—effectively eliminating global locks and centralized synchronization bottlenecks.

Technology Category

Application Category

📝 Abstract
Achieving efficient task parallelism on many-core architectures is an important challenge. The widely used GNU OpenMP implementation of the popular OpenMP parallel programming model incurs high overhead for fine-grained, short-running tasks due to time spent on runtime synchronization. In this work, we introduce and analyze three key advances that collectively achieve significant performance gains. First, we introduce XQueue, a lock-less concurrent queue implementation to replace GNU's priority task queue and remove the global task lock. Second, we develop a scalable, efficient, and hybrid lock-free/lock-less distributed tree barrier to address the high hardware synchronization overhead from GNU's centralized barrier. Third, we develop two lock-less and NUMA-aware load balancing strategies. We evaluate our implementation using Barcelona OpenMP Task Suite (BOTS) benchmarks. Results from the first and second advances demonstrate up to 1522.8$ imes$ performance improvement compared to the original GNU OpenMP. Further improvements from lock-less load balancing show up to 4$ imes$ improvement compared to GNU OpenMP using XQueue. Through a rich set of profiling and instrumentation tools, we are able to investigate the runtime behavior of GNU OpenMP and improve its performance on fine-grained tasks by many orders of magnitude.
Problem

Research questions and friction points this paper is trying to address.

Optimize fine-grained parallelism on multi-socket many-core systems.
Reduce synchronization overhead in GNU OpenMP for short tasks.
Develop lock-less load balancing strategies for improved performance.
Innovation

Methods, ideas, or system contributions that make the work stand out.

XQueue replaces GNU's priority task queue
Hybrid distributed tree barrier reduces synchronization
NUMA-aware lock-less load balancing strategies
🔎 Similar Papers
No similar papers found.