🤖 AI Summary
Emerging wireless communication and sensing applications demand massive data processing, yet existing multi-cluster many-core architectures lack systematic guidance for cluster sizing—compromising synchronization overhead, data movement efficiency, and programmability. Method: We propose a dual-buffer barrier mechanism that decouples compute cores from DMA operations, and extend the shared-memory system MemPool to multi-cluster topologies to enable efficient cross-cluster synchronization and data collaboration. Contribution/Results: Through empirical evaluation across cluster sizes on representative wireless workloads, we reveal that large clusters (e.g., 256 cores per cluster) significantly reduce synchronization and inter-cluster communication overhead: memory-intensive kernels achieve 2× speedup, while compute-intensive kernels improve by 24%. This work establishes both theoretical foundations and practical design principles for high-parallelism wireless baseband processors.
📝 Abstract
Next-generation wireless technologies (for immersive-massive communication, joint communication and sensing) demand highly parallel architectures for massive data processing. A common architectural template scales up by grouping tens to hundreds of cores into shared-memory clusters, which are then scaled out as multi-cluster manycore systems. This hierarchical design, used in GPUs and accelerators, requires a balancing act between fewer large clusters and more smaller clusters, affecting design complexity, synchronization, communication efficiency, and programmability. While all multi-cluster architectures must balance these trade-offs, there is limited insight into optimal cluster sizes. This paper analyzes various cluster configurations, focusing on synchronization, data movement overhead, and programmability for typical wireless sensing and communication workloads. We extend the open-source shared-memory cluster MemPool into a multi-cluster architecture and propose a novel double-buffering barrier that decouples processor and DMA. Our results show a single 256-core cluster can be twice as fast as 16 16-core clusters for memory-bound kernels and up to 24% faster for compute-bound kernels due to reduced synchronization and communication overheads.