π€ AI Summary
Existing hardware model checking benchmarks are limited in number, structurally homogeneous, and often lack original RTL designs, hindering effective evaluation of verification tools and risking solver overfitting. This work proposes the first approach to integrate reinforcement learning into benchmark generation for hardware model checking. By generating computation graphs at an algorithmic abstraction level and compiling them via high-level synthesis (HLS) into functionally equivalent yet structurally diverse hardware designs, the method automatically constructs βsmall yet hardβ challenging instances. Solver runtime serves as the reward signal, enabling a closed-loop co-design between design space exploration and solver feedback. The resulting benchmarks, produced in standard AIGER/BTOR2 formats, exhibit significant diversity and effectively expose performance bottlenecks in state-of-the-art model checkers, thereby providing a high-quality, unbiased resource for future tool evaluation.
π Abstract
Progress in hardware model checking depends critically on high-quality benchmarks. However, the community faces a significant benchmark gap: existing suites are limited in number, often distributed only in representations such as BTOR2 without access to the originating register-transfer-level (RTL) designs, and biased toward extreme difficulty where instances are either trivial or intractable. These limitations hinder rigorous evaluation of new verification techniques and encourage overfitting of solver heuristics to a narrow set of problems. To address this, we introduce EvolveGen, a framework for generating hardware model checking benchmarks by combining reinforcement learning (RL) with high-level synthesis (HLS). Our approach operates at an algorithmic level of abstraction in which an RL agent learns to construct computation graphs. By compiling these graphs under different synthesis directives, we produce pairs of functionally equivalent but structurally distinct hardware designs, inducing challenging model checking instances. Solver runtime is used as the reward signal, enabling the agent to autonomously discover and generate small-but-hard instances that expose solver-specific weaknesses. Experiments show that EvolveGen efficiently creates a diverse benchmark set in standard formats (e.g., AIGER and BTOR2) and effectively reveals performance bottlenecks in state-of-the-art model checkers.