🤖 AI Summary
This work addresses the performance gap of AI agents between software and real-world hardware engineering, where challenges arise from the lack of integrated evaluation on repository navigation, hierarchical localization, EDA verification, and maintainability-aware repair. To bridge this gap, we introduce Phoenix-bench, the first end-to-end benchmark for real hardware development, comprising 511 Verilator-validated hardware instances derived from GitHub repositories, accompanied by a synchronized corpus, Dockerized EDA environments, procedural tags, and fail-to-pass/pass-to-pass test harnesses. Experiments reveal that agent performance on Phoenix-bench drops by 37%–58% compared to SWE-bench; while single-round test feedback boosts repair rates by 42%–45%, perfect file localization yields only a marginal 1.4% improvement. These findings underscore fundamental differences between software and hardware engineering and highlight the critical role of test feedback in hardware repair tasks.
📝 Abstract
We ask whether agentic AI systems built for software engineering transfer to realistic hardware engineering. Existing hardware LLM benchmarks isolate sub-tasks but none jointly requires repository navigation, hierarchy-aware localization, Electronic Design Automation (EDA) executable verification, and maintenance-style patching. We introduce \textbf{Phoenix-bench}, a synchronized corpus of 511 verified Verilator instances from 114 GitHub repositories, each shipped with the developer patch, design-flow labels, fail-to-pass and pass-to-pass testbenches, and a Docker-pinned EDA environment so resolved-rate differences reflect agent behavior rather than toolchain availability. Using Phoenix-bench we run a uniform evaluation of four commercial agents and eight open-source agentic structures across four LLM backbones, plus two diagnostic interventions (file-level oracle localization and one round of testbench-log feedback). Three findings emerge. (i)~Software and hardware are fundamentally different engineering tasks: the same agent loses 37\% to 58\% from SWE-bench Verified to Phoenix-bench because hardware bugs propagate across parallel instantiated modules through signal flow rather than along a software-style call graph, and software-tuned agents stop at the symptom file instead of tracing back through the instantiation chain. (ii)~Failures concentrate on design control-flow / finite state machine (FSM) bugs, verification testbench bugs, and hard cases that demand cross-hierarchy signal-flow tracking and coordinated multi-file edits. (iii)~Localization granularity matters far more than localization itself: a perfect file-level oracle yields only $+1.4$\% because the agent then breaks files that did not need editing, while a single round of test case feedback lifts resolved rate by $42$\% to $45$\% because the test case tells \emph{where} the bug is and \emph{what} the fix has to look like.