Can Language Models Go Beyond Coding? Assessing the Capability of Language Models to Build Real-World Systems

πŸ“… 2025-11-01
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge of automatically repairing software build failures during cross-Instruction-Set-Architecture (ISA) migration. To this end, we introduce Build-benchβ€”the first end-to-end evaluation benchmark specifically designed for this scenario. Build-bench innovatively integrates architecture-aware reasoning, tool-augmented inference, and executable validation, enabling multi-turn autonomous repair via structure extraction, content modification, build execution, and log-driven feedback. We systematically evaluate six state-of-the-art large language models (LLMs) on 268 real-world build-failing packages; the best-performing model achieves a 63% build repair success rate. Our analysis reveals, for the first time, substantial disparities among models in tool-calling strategies and iterative repair behaviors. This work establishes a reproducible, executable evaluation paradigm and provides empirical foundations for LLM-driven cross-architecture software migration.

Technology Category

Application Category

πŸ“ Abstract
Large language models (LLMs) have shown growing potential in software engineering, yet few benchmarks evaluate their ability to repair software during migration across instruction set architectures (ISAs). Cross-ISA migration, such as between x86_64 and aarch64, requires handling complex dependencies, heterogeneous toolchains, and long build logs while ensuring executable verification. To address this challenge, we present Build-bench, an end-to-end benchmark that systematically evaluates the capability of LLMs to repair build failures in cross-ISA settings. Build-bench collects 268 real-world failed packages and integrates auxiliary tools including Structure Extraction, File Content Extraction, Content Modification, and Build Verification to support autonomous, tool-augmented reasoning. The repair process operates in an iterative loop where, upon failure, the model receives updated build logs and previous repair outcomes to refine subsequent attempts. Through a comparative evaluation of six representative LLMs, Build-bench reveals that current models achieve a maximum build success rate of 63% and tool usage patterns differ significantly across models. By coupling real build environments with verifiable outcomes, Build-bench establishes the first architecture-aware benchmark for studying LLM-based software build and repair.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to repair cross-ISA software build failures
Assessing autonomous tool-augmented reasoning for build error correction
Establishing architecture-aware benchmarks for software migration capabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Build-bench benchmark evaluates LLMs for cross-ISA build repair
Integrates auxiliary tools for autonomous tool-augmented reasoning
Iterative repair loop with updated logs and previous outcomes
πŸ”Ž Similar Papers
No similar papers found.