AECBench: A Hierarchical Benchmark for Knowledge Evaluation of Large Language Models in the AEC Field

📅 2025-09-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing large language models (LLMs) lack rigorous, domain-specific evaluation in safety-critical Architecture, Engineering, and Construction (AEC), hindering reliable deployment. Method: We introduce AECBench—the first hierarchical benchmark for LLMs in AEC—grounded in a cognition-inspired five-level framework (memory, comprehension, reasoning, computation, application). It comprises 23 tasks and 4,800 multi-format questions authored by practicing engineers and validated by domain experts. We propose an expert-guided automatic scoring mechanism for long-form AEC text generation, integrating LLM-as-a-Judge, dual-round expert review, and real-world tasks (e.g., code retrieval, technical documentation generation). Results: Evaluating nine state-of-the-art models reveals critical deficiencies in table parsing, complex multi-step reasoning, and professional document synthesis—highlighting fundamental reliability bottlenecks for engineering practice.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs), as a novel information technology, are seeing increasing adoption in the Architecture, Engineering, and Construction (AEC) field. They have shown their potential to streamline processes throughout the building lifecycle. However, the robustness and reliability of LLMs in such a specialized and safety-critical domain remain to be evaluated. To address this challenge, this paper establishes AECBench, a comprehensive benchmark designed to quantify the strengths and limitations of current LLMs in the AEC domain. The benchmark defines 23 representative tasks within a five-level cognition-oriented evaluation framework encompassing Knowledge Memorization, Understanding, Reasoning, Calculation, and Application. These tasks were derived from authentic AEC practice, with scope ranging from codes retrieval to specialized documents generation. Subsequently, a 4,800-question dataset encompassing diverse formats, including open-ended questions, was crafted primarily by engineers and validated through a two-round expert review. Furthermore, an LLM-as-a-Judge approach was introduced to provide a scalable and consistent methodology for evaluating complex, long-form responses leveraging expert-derived rubrics. Through the evaluation of nine LLMs, a clear performance decline across five cognitive levels was revealed. Despite demonstrating proficiency in foundational tasks at the Knowledge Memorization and Understanding levels, the models showed significant performance deficits, particularly in interpreting knowledge from tables in building codes, executing complex reasoning and calculation, and generating domain-specific documents. Consequently, this study lays the groundwork for future research and development aimed at the robust and reliable integration of LLMs into safety-critical engineering practices.
Problem

Research questions and friction points this paper is trying to address.

Evaluating robustness and reliability of LLMs in safety-critical AEC domain
Quantifying LLM performance across five cognitive levels in construction field
Assessing LLM capabilities from knowledge memorization to document generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical benchmark with five cognitive levels
Dataset with 4800 questions from authentic practice
LLM-as-a-Judge approach for scalable evaluation
🔎 Similar Papers
No similar papers found.
C
Chen Liang
College of Civil Engineering, Tongji University, 1239 Siping Road, Shanghai 200092, China
Z
Zhaoqi Huang
College of Civil Engineering, Tongji University, 1239 Siping Road, Shanghai 200092, China
Haofen Wang
Haofen Wang
Tongji University
Knowledge GraphNatural Language ProcessingRetrieval Augmented Generation
F
Fu Chai
College of Civil Engineering, Tongji University, 1239 Siping Road, Shanghai 200092, China
C
Chunying Yu
College of Civil Engineering, Tongji University, 1239 Siping Road, Shanghai 200092, China
H
Huanhuan Wei
Arcplus Group East China Architectural Design & Research Institute Co., Ltd., 151 Hankou Road, Shanghai 200002, China
Z
Zhengjie Liu
Arcplus Group East China Architectural Design & Research Institute Co., Ltd., 151 Hankou Road, Shanghai 200002, China
Yanpeng Li
Yanpeng Li
Data Scientist Lead, FedEx Dataworks
data sciencenatural language processingbiomedical informaticsclinical informatics
H
Hongjun Wang
Arcplus Group East China Architectural Design & Research Institute Co., Ltd., 151 Hankou Road, Shanghai 200002, China
R
Ruifeng Luo
Arcplus Group East China Architectural Design & Research Institute Co., Ltd., 151 Hankou Road, Shanghai 200002, China
X
Xianzhong Zhao
College of Civil Engineering, Tongji University, 1239 Siping Road, Shanghai 200092, China