🤖 AI Summary
To address severe performance bottlenecks in GNU OpenMP’s fine-grained tasking on multi-socket manycore systems—caused by runtime synchronization overhead—this paper proposes a lightweight, NUMA-aware parallel runtime optimization framework. It introduces a lock-free XQueue task queue to eliminate centralized scheduler contention; a hybrid distributed tree barrier to reduce synchronization latency; and a dual-strategy lock-free load balancing mechanism—combining local work-stealing with cross-NUMA task migration—to dynamically adapt to load imbalance. Evaluated on the BOTS benchmark, the approach achieves up to 1522.8× single-node speedup over GNU OpenMP; integrating dynamic load balancing further improves fine-grained task performance by up to 4×, with order-of-magnitude gains in specific scenarios. The core contribution lies in the first synergistic integration of lock-free task queues, distributed barriers, and NUMA-aware dual-mode load balancing—effectively eliminating global locks and centralized synchronization bottlenecks.
📝 Abstract
Achieving efficient task parallelism on many-core architectures is an important challenge. The widely used GNU OpenMP implementation of the popular OpenMP parallel programming model incurs high overhead for fine-grained, short-running tasks due to time spent on runtime synchronization. In this work, we introduce and analyze three key advances that collectively achieve significant performance gains. First, we introduce XQueue, a lock-less concurrent queue implementation to replace GNU's priority task queue and remove the global task lock. Second, we develop a scalable, efficient, and hybrid lock-free/lock-less distributed tree barrier to address the high hardware synchronization overhead from GNU's centralized barrier. Third, we develop two lock-less and NUMA-aware load balancing strategies. We evaluate our implementation using Barcelona OpenMP Task Suite (BOTS) benchmarks. Results from the first and second advances demonstrate up to 1522.8$ imes$ performance improvement compared to the original GNU OpenMP. Further improvements from lock-less load balancing show up to 4$ imes$ improvement compared to GNU OpenMP using XQueue. Through a rich set of profiling and instrumentation tools, we are able to investigate the runtime behavior of GNU OpenMP and improve its performance on fine-grained tasks by many orders of magnitude.