π€ AI Summary
This work addresses the challenge of automatically repairing software build failures during cross-Instruction-Set-Architecture (ISA) migration. To this end, we introduce Build-benchβthe first end-to-end evaluation benchmark specifically designed for this scenario. Build-bench innovatively integrates architecture-aware reasoning, tool-augmented inference, and executable validation, enabling multi-turn autonomous repair via structure extraction, content modification, build execution, and log-driven feedback. We systematically evaluate six state-of-the-art large language models (LLMs) on 268 real-world build-failing packages; the best-performing model achieves a 63% build repair success rate. Our analysis reveals, for the first time, substantial disparities among models in tool-calling strategies and iterative repair behaviors. This work establishes a reproducible, executable evaluation paradigm and provides empirical foundations for LLM-driven cross-architecture software migration.
π Abstract
Large language models (LLMs) have shown growing potential in software engineering, yet few benchmarks evaluate their ability to repair software during migration across instruction set architectures (ISAs). Cross-ISA migration, such as between x86_64 and aarch64, requires handling complex dependencies, heterogeneous toolchains, and long build logs while ensuring executable verification. To address this challenge, we present Build-bench, an end-to-end benchmark that systematically evaluates the capability of LLMs to repair build failures in cross-ISA settings. Build-bench collects 268 real-world failed packages and integrates auxiliary tools including Structure Extraction, File Content Extraction, Content Modification, and Build Verification to support autonomous, tool-augmented reasoning. The repair process operates in an iterative loop where, upon failure, the model receives updated build logs and previous repair outcomes to refine subsequent attempts. Through a comparative evaluation of six representative LLMs, Build-bench reveals that current models achieve a maximum build success rate of 63% and tool usage patterns differ significantly across models. By coupling real build environments with verifiable outcomes, Build-bench establishes the first architecture-aware benchmark for studying LLM-based software build and repair.