VeriContaminated: Assessing LLM-Driven Verilog Coding for Data Contamination

📅 2025-03-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work presents the first systematic assessment of data contamination in large language models (LLMs) for Verilog code generation, focusing on the VerilogEval and RTLLM benchmarks and exposing severe pretraining/finetuning data leakage that compromises evaluation validity. Methodologically, it adapts the CCD (Code Contamination Detection) and Min-K% Probability contamination detection techniques—previously developed for general-purpose programming languages—to the domain of hardware description languages, and introduces a formal trade-off framework between contamination mitigation and generation quality/evaluation fairness. Through benchmark-driven, cross-model analysis (LLaMA, GPT, DeepSeek-Coder, Mistral, etc.), the study empirically confirms pervasive Verilog data contamination across mainstream LLMs. Quantitative contamination measurements further demonstrate that mitigation strategies improve evaluation fairness but incur measurable degradation in functional correctness and syntactic quality. This work establishes both methodological foundations and empirical evidence for reliable LLM evaluation in hardware design automation.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have revolutionized code generation, achieving exceptional results on various established benchmarking frameworks. However, concerns about data contamination - where benchmark data inadvertently leaks into pre-training or fine-tuning datasets - raise questions about the validity of these evaluations. While this issue is known, limiting the industrial adoption of LLM-driven software engineering, hardware coding has received little to no attention regarding these risks. For the first time, we analyze state-of-the-art (SOTA) evaluation frameworks for Verilog code generation (VerilogEval and RTLLM), using established methods for contamination detection (CCD and Min-K% Prob). We cover SOTA commercial and open-source LLMs (CodeGen2.5, Minitron 4b, Mistral 7b, phi-4 mini, LLaMA-{1,2,3.1}, GPT-{2,3.5,4o}, Deepseek-Coder, and CodeQwen 1.5), in baseline and fine-tuned models (RTLCoder and Verigen). Our study confirms that data contamination is a critical concern. We explore mitigations and the resulting trade-offs for code quality vs fairness (i.e., reducing contamination toward unbiased benchmarking).
Problem

Research questions and friction points this paper is trying to address.

Assessing data contamination in LLM-driven Verilog coding
Evaluating validity of benchmarks in hardware code generation
Exploring trade-offs between code quality and fairness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzes Verilog code generation frameworks
Uses CCD and Min-K% Prob for contamination detection
Explores trade-offs between code quality and fairness
🔎 Similar Papers
No similar papers found.