Reward-Weighted On-Policy Distillation with an Open Property-Equivalence Verifier for NL-to-SVA Generation

📅 2026-05-13

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

Existing NL2SVA approaches rely on supervised fine-tuning, merely mimicking surface-level tokens while neglecting semantic equivalence, resulting in limited generalization for bounded liveness and safety specifications. This work proposes Reward-Weighted Online Policy Distillation (RWOPD), the first method to integrate the SymbiYosys+Z3 property equivalence verifier into the training loop. By leveraging verification-passing samples, RWOPD applies reward-weighted forward KL gradient updates from a frozen teacher model (CodeV-SVA-14B) to a lightweight student model (Qwen2.5-Coder-7B-Instruct), preserving token-level supervision density while ensuring semantic correctness of generated SVAs. The approach establishes new state-of-the-art results on both NL2SVA-Human and NL2SVA-Machine benchmarks, with pass@1/5/10 scores surpassing those of specialized models and even general-purpose large language models with up to 671B parameters.

📝 Abstract

LLM-based generation of SystemVerilog Assertions (SVA) is often reported as nearing saturation, with the strongest specialized model reaching ${\sim}76\%$ accuracy on NL2SVA-Human. We show that this aggregate hides a temporal gap: models that appear strong overall still collapse to a few implication templates on bounded-delay and liveness specifications. The core issue is that the dominant recipe, supervised fine-tuning on NL/SVA pairs, optimizes token-level mimicry rather than the \emph{property equivalence} that defines SVA correctness. We introduce \emph{Reward-Weighted On-Policy Distillation} (RWOPD), an on-policy distillation method that samples student rollouts, scores them with an open SymbiYosys+Z3 Property-Equivalence Checker (PEC), and applies a verifier-reward-weighted forward-KL gradient from a frozen 14B teacher on verifier-passable rollouts. This keeps the supervision dense at every response token while grounding both selection and loss weight in property-equivalent behavior. RWOPD distills CodeV-SVA-14B into a Qwen2.5-Coder-7B-Instruct student that sets a new state of the art on NL2SVA-Human and NL2SVA-Machine across pass@1, pass@5, and pass@10, surpassing both specialized prior SOTA models and 671B general-purpose baselines.

Problem

Research questions and friction points this paper is trying to address.

NL-to-SVA generation

property equivalence

SystemVerilog Assertions

supervised fine-tuning

temporal specifications

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reward-Weighted On-Policy Distillation

Property-Equivalence Verification

NL-to-SVA Generation