Revisiting VerilogEval: A Year of Improvements in Large-Language Models for Hardware Code Generation

📅 2024-08-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited performance of large language models (LLMs) in hardware description language generation—particularly Verilog RTL—stemming from scarce training data, absence of standardized benchmarks, and inadequate evaluation. To tackle these challenges, we propose a systematic improvement framework. Our contributions are threefold: (1) a restructured VerilogEval benchmark featuring specification-to-RTL translation tasks, automated failure attribution classification, and native in-context learning support; (2) a standardized evaluation framework integrating prompt engineering optimization, fine-grained error attribution analysis, and RTL-specific code completion techniques; and (3) empirical evidence demonstrating the decisive impact of prompting strategies on model performance. Experiments show GPT-4o achieves a 63% pass rate on specification-to-RTL translation, Llama3.1-405B (open-weight) attains 58%, and the lightweight RTL-Coder-6.7B reaches 34%. Infrastructure enhancements significantly improve debuggability and reproducibility. All code and benchmarks are publicly released.

Technology Category

Application Category

📝 Abstract
The application of large-language models (LLMs) to digital hardware code generation is an emerging field, with most LLMs primarily trained on natural language and software code. Hardware code like Verilog constitutes a small portion of training data, and few hardware benchmarks exist. The open-source VerilogEval benchmark, released in November 2023, provided a consistent evaluation framework for LLMs on code completion tasks. Since then, both commercial and open models have seen significant development. In this work, we evaluate new commercial and open models since VerilogEval's original release-including GPT-4o, GPT-4 Turbo, Llama3.1 (8B/70B/405B), Llama3 70B, Mistral Large, DeepSeek Coder (33B and 6.7B), CodeGemma 7B, and RTL-Coder-against an improved VerilogEval benchmark suite. We find measurable improvements in state-of-the-art models: GPT-4o achieves a 63% pass rate on specification-to-RTL tasks. The recently released and open Llama3.1 405B achieves a 58% pass rate, almost matching GPT-4o, while the smaller domain-specific RTL-Coder 6.7B models achieve an impressive 34% pass rate. Additionally, we enhance VerilogEval's infrastructure by automatically classifying failures, introducing in-context learning support, and extending the tasks to specification-to-RTL translation. We find that prompt engineering remains crucial for achieving good pass rates and varies widely with model and task. A benchmark infrastructure that allows for prompt engineering and failure analysis is essential for continued model development and deployment.
Problem

Research questions and friction points this paper is trying to address.

Evaluate LLMs on hardware code generation
Enhance VerilogEval benchmark infrastructure
Analyze prompt engineering impact on performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Enhanced VerilogEval benchmark suite
Automated failure classification system
In-context learning support integration
🔎 Similar Papers
No similar papers found.