EvolveGen: Algorithmic Level Hardware Model Checking Benchmark Generation through Reinforcement Learning

📅 2026-02-25

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Existing hardware model checking benchmarks are limited in number, structurally homogeneous, and often lack original RTL designs, hindering effective evaluation of verification tools and risking solver overfitting. This work proposes the first approach to integrate reinforcement learning into benchmark generation for hardware model checking. By generating computation graphs at an algorithmic abstraction level and compiling them via high-level synthesis (HLS) into functionally equivalent yet structurally diverse hardware designs, the method automatically constructs “small yet hard” challenging instances. Solver runtime serves as the reward signal, enabling a closed-loop co-design between design space exploration and solver feedback. The resulting benchmarks, produced in standard AIGER/BTOR2 formats, exhibit significant diversity and effectively expose performance bottlenecks in state-of-the-art model checkers, thereby providing a high-quality, unbiased resource for future tool evaluation.

Technology Category

Application Category

📝 Abstract

Progress in hardware model checking depends critically on high-quality benchmarks. However, the community faces a significant benchmark gap: existing suites are limited in number, often distributed only in representations such as BTOR2 without access to the originating register-transfer-level (RTL) designs, and biased toward extreme difficulty where instances are either trivial or intractable. These limitations hinder rigorous evaluation of new verification techniques and encourage overfitting of solver heuristics to a narrow set of problems. To address this, we introduce EvolveGen, a framework for generating hardware model checking benchmarks by combining reinforcement learning (RL) with high-level synthesis (HLS). Our approach operates at an algorithmic level of abstraction in which an RL agent learns to construct computation graphs. By compiling these graphs under different synthesis directives, we produce pairs of functionally equivalent but structurally distinct hardware designs, inducing challenging model checking instances. Solver runtime is used as the reward signal, enabling the agent to autonomously discover and generate small-but-hard instances that expose solver-specific weaknesses. Experiments show that EvolveGen efficiently creates a diverse benchmark set in standard formats (e.g., AIGER and BTOR2) and effectively reveals performance bottlenecks in state-of-the-art model checkers.

Problem

Research questions and friction points this paper is trying to address.

hardware model checking

benchmark generation

RTL designs

solver evaluation

verification benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

reinforcement learning

hardware model checking

benchmark generation