🤖 AI Summary
To address the weak semantic alignment between natural language (NL) and Verilog, and the inadequacy of existing evaluation metrics for hardware description language (HDL) generation, this paper proposes DeepRTL—a unified representation model. Methodologically, DeepRTL extends CodeT5+ and is trained on a novel multi-level NL-Verilog alignment dataset; it employs curriculum learning to jointly fine-tune two complementary tasks: Verilog understanding and generation. We introduce the first dedicated Verilog understanding benchmark and propose a new semantic consistency evaluation paradigm integrating embedding similarity and GPT Score. Experiments demonstrate that DeepRTL significantly outperforms GPT-4 on Verilog understanding and matches OpenAI o1-preview in synthesizable Verilog generation quality. Crucially, our evaluation framework more accurately reflects semantic correctness and end-to-end synthesizability—addressing critical limitations of prior metrics in HDL generation assessment.
📝 Abstract
Recent advancements in large language models (LLMs) have shown significant potential for automating hardware description language (HDL) code generation from high-level natural language instructions. While fine-tuning has improved LLMs' performance in hardware design tasks, prior efforts have largely focused on Verilog generation, overlooking the equally critical task of Verilog understanding. Furthermore, existing models suffer from weak alignment between natural language descriptions and Verilog code, hindering the generation of high-quality, synthesizable designs. To address these issues, we present DeepRTL, a unified representation model that excels in both Verilog understanding and generation. Based on CodeT5+, DeepRTL is fine-tuned on a comprehensive dataset that aligns Verilog code with rich, multi-level natural language descriptions. We also introduce the first benchmark for Verilog understanding and take the initiative to apply embedding similarity and GPT Score to evaluate the models' understanding capabilities. These metrics capture semantic similarity more accurately than traditional methods like BLEU and ROUGE, which are limited to surface-level n-gram overlaps. By adapting curriculum learning to train DeepRTL, we enable it to significantly outperform GPT-4 in Verilog understanding tasks, while achieving performance on par with OpenAI's o1-preview model in Verilog generation tasks.