QiMeng-CodeV-SVA: Training Specialized LLMs for Hardware Assertion Generation via RTL-Grounded Bidirectional Data Synthesis

📅 2026-03-15
📈 Citations: 0
Influential: 0
📄 PDF

career value

226K/year
🤖 AI Summary
Existing general-purpose large language models exhibit limited performance on the natural language to SystemVerilog assertion (NL2SVA) task, primarily due to the scarcity of high-quality real-world SVA corpora and the absence of reliable methods for assessing semantic equivalence between natural language specifications and SVA code. To address this, this work proposes an RTL-guided bidirectional data synthesis framework that leverages open-source RTL designs to generate large-scale NL↔SVA translation pairs, followed by semantic consistency filtering to construct high-quality training data. Using this approach, we train CodeV-SVA, the first family of specialized large models for NL2SVA. The CodeV-SVA-14B variant achieves 75.8% and 84.0% Func.@1 accuracy on the NL2SVA-Human and NL2SVA-Machine benchmarks, respectively, matching or surpassing the performance of state-of-the-art models such as GPT-5 and DeepSeek-R1.

Technology Category

Application Category

📝 Abstract
SystemVerilog Assertions (SVAs) are crucial for hardware verification. Recent studies leverage general-purpose LLMs to translate natural language properties to SVAs (NL2SVA), but they perform poorly due to limited data. We propose a data synthesis framework to tackle two challenges: the scarcity of high-quality real-world SVA corpora and the lack of reliable methods to determine NL-SVA semantic equivalence. For the former, large-scale open-source RTLs are used to guide LLMs to generate real-world SVAs; for the latter, bidirectional translation serves as a data selection method. With the synthesized data, we train CodeV-SVA, a series of SVA generation models. Notably, CodeV-SVA-14B achieves 75.8% on NL2SVA-Human and 84.0% on NL2SVA-Machine in Func.@1, matching or exceeding advanced LLMs like GPT-5 and DeepSeek-R1.
Problem

Research questions and friction points this paper is trying to address.

SystemVerilog Assertions
NL2SVA
data scarcity
semantic equivalence
hardware verification
Innovation

Methods, ideas, or system contributions that make the work stand out.

RTL-Grounded Synthesis
Bidirectional Data Selection
NL2SVA
Specialized LLMs
Hardware Assertion Generation