Optimizing Scalable Multi-Cluster Architectures for Next-Generation Wireless Sensing and Communication

📅 2025-07-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Emerging wireless communication and sensing applications demand massive data processing, yet existing multi-cluster many-core architectures lack systematic guidance for cluster sizing—compromising synchronization overhead, data movement efficiency, and programmability. Method: We propose a dual-buffer barrier mechanism that decouples compute cores from DMA operations, and extend the shared-memory system MemPool to multi-cluster topologies to enable efficient cross-cluster synchronization and data collaboration. Contribution/Results: Through empirical evaluation across cluster sizes on representative wireless workloads, we reveal that large clusters (e.g., 256 cores per cluster) significantly reduce synchronization and inter-cluster communication overhead: memory-intensive kernels achieve 2× speedup, while compute-intensive kernels improve by 24%. This work establishes both theoretical foundations and practical design principles for high-parallelism wireless baseband processors.

Technology Category

Application Category

📝 Abstract
Next-generation wireless technologies (for immersive-massive communication, joint communication and sensing) demand highly parallel architectures for massive data processing. A common architectural template scales up by grouping tens to hundreds of cores into shared-memory clusters, which are then scaled out as multi-cluster manycore systems. This hierarchical design, used in GPUs and accelerators, requires a balancing act between fewer large clusters and more smaller clusters, affecting design complexity, synchronization, communication efficiency, and programmability. While all multi-cluster architectures must balance these trade-offs, there is limited insight into optimal cluster sizes. This paper analyzes various cluster configurations, focusing on synchronization, data movement overhead, and programmability for typical wireless sensing and communication workloads. We extend the open-source shared-memory cluster MemPool into a multi-cluster architecture and propose a novel double-buffering barrier that decouples processor and DMA. Our results show a single 256-core cluster can be twice as fast as 16 16-core clusters for memory-bound kernels and up to 24% faster for compute-bound kernels due to reduced synchronization and communication overheads.
Problem

Research questions and friction points this paper is trying to address.

Balancing cluster sizes in multi-cluster architectures for optimal performance
Reducing synchronization and communication overheads in wireless sensing workloads
Evaluating trade-offs between large and small clusters for efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends MemPool into multi-cluster architecture
Proposes novel double-buffering barrier technique
Optimizes cluster size for reduced overheads
🔎 Similar Papers
No similar papers found.