🤖 AI Summary
This work addresses the challenge of hardware verification, which relies on costly emulator feedback and struggles to support online reinforcement learning. The authors propose an execution-aware offline agent learning framework that formulates verification as a memoryless state-transition process guided by a deterministic evaluator. By integrating curated verification data, policy-aware synthetic data generation, and worst-state-prioritized sampling, the framework enables efficient and scalable learning under strict constraint enforcement. Remarkably, using only a 4B-parameter model, the approach achieves a 69.2% coverage pass rate in agent evaluation—surpassing the teacher model by 5.3% and matching the performance of models ten times larger—thereby substantially alleviating the feedback scarcity bottleneck inherent in hardware verification.
📝 Abstract
Execution-aware LLM agents offer a promising paradigm for learning from tool feedback, but such feedback is often expensive and slow to obtain, making online reinforcement learning (RL) impractical. High-coverage hardware verification exemplifies this challenge due to its reliance on industrial simulators and non-differentiable execution signals. We propose LLM4Cov, an offline agent-learning framework that models verification as memoryless state transitions guided by deterministic evaluators. Building on this formulation, we introduce execution-validated data curation, policy-aware agentic data synthesis, and worst-state-prioritized sampling to enable scalable learning under execution constraints. We further curate a reality-aligned benchmark adapted from an existing verification suite through a revised evaluation protocol. Using the proposed pipeline, a compact 4B-parameter model achieves 69.2% coverage pass rate under agentic evaluation, outperforming its teacher by 5.3% and demonstrating competitive performance against models an order of magnitude larger.