🤖 AI Summary
Traditional single-threaded RTL simulation suffers severe performance bottlenecks as chip complexity grows, limiting verification scalability.
Method: This paper proposes a cycle-accurate RTL simulation paradigm tailored for thousand-core parallelism, built upon the Graphcore IPU architecture. It introduces a novel fine-grained RTL graph partitioning scheme and a dedicated compiler, along with a lightweight synchronization protocol and communication optimization mechanisms.
Contribution/Results: The approach achieves the first-ever 5,888-core massively parallel RTL simulation. A systematic quantitative analysis isolates synchronization, communication, and computation overheads. Evaluated on a 4-IPU system, it delivers up to 4× speedup over state-of-the-art x86-based multi-core RTL simulators. This work establishes a scalable, hardware-accelerated parallelization pathway for ultra-large-scale hardware verification.
📝 Abstract
Hardware development critically depends on cycle-accurate RTL simulation. However, as chip complexity increases, conventional single-threaded simulation becomes impractical due to stagnant single-core performance. Parendi is an RTL simulator that addresses this challenge by exploiting the abundant fine-grained parallelism inherent in RTL simulation and efficiently mapping it onto the massively parallel Graphcore IPU (Intelligence Processing Unit) architecture. Parendi scales up to 5888 cores on 4 Graphcore IPU sockets. It allows us to run large RTL designs up to 4$ imes$ faster than the most powerful state-of-the-art x64 multicore systems. To achieve this performance, we developed new partitioning and compilation techniques and carefully quantified the synchronization, communication, and computation costs of parallel RTL simulation: The paper comprehensively analyzes these factors and details the strategies that Parendi uses to optimize them.