π€ AI Summary
To address severe contention and scalability bottlenecks caused by fetch-and-add operations on a single memory location under high concurrency, this paper proposes Aggregating Funnelsβa novel mechanism that distributes atomic operations across multiple memory locations to enable cross-location batch aggregation and decoupled result computation. Our approach leverages dual-location coordinated batching, lock-free concurrency control, and fine-grained memory layout optimization, building an efficient aggregation path directly atop hardware-supported fetch-and-add instructions. Unlike conventional single-point or combining funnels, Aggregating Funnels overcomes fundamental scalability limits inherent in prior designs. Experimental evaluation demonstrates significantly higher throughput compared to state-of-the-art Combining Funnels. When integrated into mainstream concurrent queues, it delivers substantial end-to-end performance improvements by eliminating critical serialization bottlenecks.
π Abstract
Many concurrent algorithms require processes to perform fetch-and-add operations on a single memory location, which can be a hot spot of contention. We present a novel algorithm called Aggregating Funnels that reduces this contention by spreading the fetch-and-add operations across multiple memory locations. It aggregates fetch-and-add operations into batches so that the batch can be performed by a single hardware fetch-and-add instruction on one location and all operations in the batch can efficiently compute their results by performing a fetch-and-add instruction on a different location. We show experimentally that this approach achieves higher throughput than previous combining techniques, such as Combining Funnels, and is substantially more scalable than applying hardware fetch-and-add instructions on a single memory location. We show that replacing the fetch-and-add instructions in the fastest state-of-the-art concurrent queue by our Aggregating Funnels eliminates a bottleneck and greatly improves the queue's overall throughput.