🤖 AI Summary
This work addresses the lack of systematic benchmarks for evaluating large language models’ (LLMs’) capability to perform repository-scale bug repair in real-world hardware projects. We introduce HWE-Bench, the first hardware repository-level repair benchmark, comprising 417 authentic historical repair tasks from six open-source projects spanning RISC-V cores, SoCs, and hardware roots of trust, with support for Verilog/SystemVerilog and Chisel. Leveraging containerized environments, native simulation, regression testing, and an automated data pipeline, HWE-Bench enables scalable, multi-project evaluation. Experiments show that the best LLM agent achieves an overall repair success rate of 70.7%—exceeding 90% on small cores but dropping below 65% on complex SoCs—highlighting three key challenges in hardware debugging: fault localization, semantic reasoning, and cross-artifact coordination. Notably, performance disparities among models are substantially greater than those observed in software repair scenarios.
📝 Abstract
Existing benchmarks for hardware design primarily evaluate Large Language Models (LLMs) on isolated, component-level tasks such as generating HDL modules from specifications, leaving repository-scale evaluation unaddressed. We introduce HWE-Bench, the first large-scale, repository-level benchmark for evaluating LLM agents on real-world hardware bug repair tasks. HWE-Bench comprises 417 task instances derived from real historical bug-fix pull requests across six major open-source projects spanning both Verilog/SystemVerilog and Chisel, covering RISC-V cores, SoCs, and security roots-of-trust. Each task is grounded in a fully containerized environment where the agent must resolve a real bug report, with correctness validated through the project's native simulation and regression flows. The benchmark is built through a largely automated pipeline that enables efficient expansion to new repositories. We evaluate seven LLMs with four agent frameworks and find that the best agent resolves 70.7% of tasks overall, with performance exceeding 90% on smaller cores but dropping below 65% on complex SoC-level projects. We observe larger performance gaps across models than commonly reported on software benchmarks, and difficulty is driven by project scope and bug-type distribution rather than code size alone. Our failure analysis traces agent failures to three stages of the debugging process: fault localization, hardware-semantic reasoning, and cross-artifact coordination across RTL, configuration, and verification components, providing concrete directions for developing more capable hardware-aware agents.