🤖 AI Summary
Existing NL2SVA approaches rely on supervised fine-tuning, merely mimicking surface-level tokens while neglecting semantic equivalence, resulting in limited generalization for bounded liveness and safety specifications. This work proposes Reward-Weighted Online Policy Distillation (RWOPD), the first method to integrate the SymbiYosys+Z3 property equivalence verifier into the training loop. By leveraging verification-passing samples, RWOPD applies reward-weighted forward KL gradient updates from a frozen teacher model (CodeV-SVA-14B) to a lightweight student model (Qwen2.5-Coder-7B-Instruct), preserving token-level supervision density while ensuring semantic correctness of generated SVAs. The approach establishes new state-of-the-art results on both NL2SVA-Human and NL2SVA-Machine benchmarks, with pass@1/5/10 scores surpassing those of specialized models and even general-purpose large language models with up to 671B parameters.
📝 Abstract
LLM-based generation of SystemVerilog Assertions (SVA) is often reported as nearing saturation, with the strongest specialized model reaching ${\sim}76\%$ accuracy on NL2SVA-Human. We show that this aggregate hides a temporal gap: models that appear strong overall still collapse to a few implication templates on bounded-delay and liveness specifications. The core issue is that the dominant recipe, supervised fine-tuning on NL/SVA pairs, optimizes token-level mimicry rather than the \emph{property equivalence} that defines SVA correctness. We introduce \emph{Reward-Weighted On-Policy Distillation} (RWOPD), an on-policy distillation method that samples student rollouts, scores them with an open SymbiYosys+Z3 Property-Equivalence Checker (PEC), and applies a verifier-reward-weighted forward-KL gradient from a frozen 14B teacher on verifier-passable rollouts. This keeps the supervision dense at every response token while grounding both selection and loss weight in property-equivalent behavior. RWOPD distills CodeV-SVA-14B into a Qwen2.5-Coder-7B-Instruct student that sets a new state of the art on NL2SVA-Human and NL2SVA-Machine across pass@1, pass@5, and pass@10, surpassing both specialized prior SOTA models and 671B general-purpose baselines.