AECBench: A Hierarchical Benchmark for Knowledge Evaluation of Large Language Models in the AEC Field

📅 2025-09-23

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Existing large language models (LLMs) lack rigorous, domain-specific evaluation in safety-critical Architecture, Engineering, and Construction (AEC), hindering reliable deployment. Method: We introduce AECBench—the first hierarchical benchmark for LLMs in AEC—grounded in a cognition-inspired five-level framework (memory, comprehension, reasoning, computation, application). It comprises 23 tasks and 4,800 multi-format questions authored by practicing engineers and validated by domain experts. We propose an expert-guided automatic scoring mechanism for long-form AEC text generation, integrating LLM-as-a-Judge, dual-round expert review, and real-world tasks (e.g., code retrieval, technical documentation generation). Results: Evaluating nine state-of-the-art models reveals critical deficiencies in table parsing, complex multi-step reasoning, and professional document synthesis—highlighting fundamental reliability bottlenecks for engineering practice.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs), as a novel information technology, are seeing increasing adoption in the Architecture, Engineering, and Construction (AEC) field. They have shown their potential to streamline processes throughout the building lifecycle. However, the robustness and reliability of LLMs in such a specialized and safety-critical domain remain to be evaluated. To address this challenge, this paper establishes AECBench, a comprehensive benchmark designed to quantify the strengths and limitations of current LLMs in the AEC domain. The benchmark defines 23 representative tasks within a five-level cognition-oriented evaluation framework encompassing Knowledge Memorization, Understanding, Reasoning, Calculation, and Application. These tasks were derived from authentic AEC practice, with scope ranging from codes retrieval to specialized documents generation. Subsequently, a 4,800-question dataset encompassing diverse formats, including open-ended questions, was crafted primarily by engineers and validated through a two-round expert review. Furthermore, an LLM-as-a-Judge approach was introduced to provide a scalable and consistent methodology for evaluating complex, long-form responses leveraging expert-derived rubrics. Through the evaluation of nine LLMs, a clear performance decline across five cognitive levels was revealed. Despite demonstrating proficiency in foundational tasks at the Knowledge Memorization and Understanding levels, the models showed significant performance deficits, particularly in interpreting knowledge from tables in building codes, executing complex reasoning and calculation, and generating domain-specific documents. Consequently, this study lays the groundwork for future research and development aimed at the robust and reliable integration of LLMs into safety-critical engineering practices.

Problem

Research questions and friction points this paper is trying to address.

Evaluating robustness and reliability of LLMs in safety-critical AEC domain

Quantifying LLM performance across five cognitive levels in construction field

Assessing LLM capabilities from knowledge memorization to document generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical benchmark with five cognitive levels

Dataset with 4800 questions from authentic practice

LLM-as-a-Judge approach for scalable evaluation

🔎 Similar Papers

No similar papers found.