🤖 AI Summary
In particle-in-cell (PIC) simulations of low-temperature plasmas, charge deposition (CD) suffers from severe parallel bottlenecks due to frequent particle–mesh interactions—especially in 2D/3D device-scale simulations, where conventional per-core private mesh strategies incur substantial memory redundancy and poor scalability. To address this, we propose a particle–thread binding mechanism: only four private meshes per node are required, achieved via fine-grained thread binding, flag-based synchronization, and a lightweight arbitration function that prevents concurrent particle updates to the same mesh cell. The method preserves standard PIC data structures and requires minimal code modifications. Experimental evaluation on large-scale distributed-memory (thousand-core) and shared-memory systems demonstrates strong scalability of the CD kernel—significantly outperforming traditional approaches—while maintaining low hardware dependency and implementation overhead.
📝 Abstract
The Particle-In-Cell (PIC) method for plasma simulation tracks particle phase space information using particle and grid data structures. High computational costs in 2D and 3D device-scale PIC simulations necessitate parallelization, with the Charge Deposition (CD) subroutine often becoming a bottleneck due to frequent particle-grid interactions. Conventional methods mitigate dependencies by generating private grids for each core, but this approach faces scalability issues. We propose a novel approach based on a particle-thread binding strategy that requires only four private grids per node in distributed memory systems or four private grids in shared memory systems, enhancing CD scalability and performance while maintaining conventional data structures and requiring minimal changes to existing PIC codes. This method ensures complete accessibility of grid data structure for concurrent threads and avoids simultaneous access to particles within the same cell using additional functions and flags. Performance evaluations using a PIC benchmark for low-temperature partially magnetized E x B discharge simulation on a shared memory as well as a distributed memory system (1000 cores) demonstrate the method's scalability, and additionally, we show the method has little hardware dependency.