EngiBench: A Benchmark for Evaluating Large Language Models on Engineering Problem Solving

📅 2025-09-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing engineering benchmarks inadequately capture real-world uncertainty, context dependency, and openness, necessitating a high-order capability evaluation framework. This paper introduces EngiBench—the first hierarchical engineering problem-solving benchmark—spanning foundational knowledge retrieval, multi-step contextual reasoning, and open-ended modeling across diverse engineering subdomains. Innovatively, it incorporates three controlled-variable variants—perturbation, knowledge augmentation, and mathematical abstraction—to disentangle assessment of model robustness, domain knowledge mastery, and mathematical reasoning ability. Leveraging expert annotation, structured rewriting, and a fine-grained evaluation protocol, EngiBench enables multidimensional capability quantification. Experimental results demonstrate that state-of-the-art large language models significantly underperform humans on high-order tasks, exhibit poor robustness, and suffer sharp performance degradation with increasing complexity—revealing fundamental bottlenecks in advanced engineering reasoning.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have shown strong performance on mathematical reasoning under well-posed conditions. However, real-world engineering problems require more than mathematical symbolic computation -- they need to deal with uncertainty, context, and open-ended scenarios. Existing benchmarks fail to capture these complexities. We introduce EngiBench, a hierarchical benchmark designed to evaluate LLMs on solving engineering problems. It spans three levels of increasing difficulty (foundational knowledge retrieval, multi-step contextual reasoning, and open-ended modeling) and covers diverse engineering subfields. To facilitate a deeper understanding of model performance, we systematically rewrite each problem into three controlled variants (perturbed, knowledge-enhanced, and math abstraction), enabling us to separately evaluate the model's robustness, domain-specific knowledge, and mathematical reasoning abilities. Experiment results reveal a clear performance gap across levels: models struggle more as tasks get harder, perform worse when problems are slightly changed, and fall far behind human experts on the high-level engineering tasks. These findings reveal that current LLMs still lack the high-level reasoning needed for real-world engineering, highlighting the need for future models with deeper and more reliable problem-solving capabilities. Our source code and data are available at https://github.com/EngiBench/EngiBench.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs on real-world engineering problems with uncertainty and context
Assessing model performance across hierarchical difficulty levels in engineering
Measuring robustness and reasoning gaps between LLMs and human experts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical benchmark with three difficulty levels
Systematic problem rewriting into controlled variants
Separate evaluation of robustness, knowledge, and reasoning
🔎 Similar Papers
No similar papers found.
Xiyuan Zhou
Xiyuan Zhou
Nanyang Technological University
large language modelcarbon marketmachine learning
X
Xinlei Wang
School of Electrical and Information Engineering, University of Sydney
Yirui He
Yirui He
University of California, Irvine
Software Engineering
Y
Yang Wu
School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen
R
Ruixi Zou
School of Data Science, The Chinese University of Hong Kong, Shenzhen
Yuheng Cheng
Yuheng Cheng
CUHK(SZ)
Y
Yulu Xie
School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen
W
Wenxuan Liu
School of Electrical and Electronic Engineering, Nanyang Technological University
H
Huan Zhao
Department of Building Environment and Energy Engineering, Hong Kong Polytechnic University
Y
Yan Xu
School of Electrical and Electronic Engineering, Nanyang Technological University
J
Jinjin Gu
INSAIT, Sofia University
J
Junhua Zhao
School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen