EngiBench: A Benchmark for Evaluating Large Language Models on Engineering Problem Solving

📅 2025-09-22

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

Existing engineering benchmarks inadequately capture real-world uncertainty, context dependency, and openness, necessitating a high-order capability evaluation framework. This paper introduces EngiBench—the first hierarchical engineering problem-solving benchmark—spanning foundational knowledge retrieval, multi-step contextual reasoning, and open-ended modeling across diverse engineering subdomains. Innovatively, it incorporates three controlled-variable variants—perturbation, knowledge augmentation, and mathematical abstraction—to disentangle assessment of model robustness, domain knowledge mastery, and mathematical reasoning ability. Leveraging expert annotation, structured rewriting, and a fine-grained evaluation protocol, EngiBench enables multidimensional capability quantification. Experimental results demonstrate that state-of-the-art large language models significantly underperform humans on high-order tasks, exhibit poor robustness, and suffer sharp performance degradation with increasing complexity—revealing fundamental bottlenecks in advanced engineering reasoning.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have shown strong performance on mathematical reasoning under well-posed conditions. However, real-world engineering problems require more than mathematical symbolic computation -- they need to deal with uncertainty, context, and open-ended scenarios. Existing benchmarks fail to capture these complexities. We introduce EngiBench, a hierarchical benchmark designed to evaluate LLMs on solving engineering problems. It spans three levels of increasing difficulty (foundational knowledge retrieval, multi-step contextual reasoning, and open-ended modeling) and covers diverse engineering subfields. To facilitate a deeper understanding of model performance, we systematically rewrite each problem into three controlled variants (perturbed, knowledge-enhanced, and math abstraction), enabling us to separately evaluate the model's robustness, domain-specific knowledge, and mathematical reasoning abilities. Experiment results reveal a clear performance gap across levels: models struggle more as tasks get harder, perform worse when problems are slightly changed, and fall far behind human experts on the high-level engineering tasks. These findings reveal that current LLMs still lack the high-level reasoning needed for real-world engineering, highlighting the need for future models with deeper and more reliable problem-solving capabilities. Our source code and data are available at https://github.com/EngiBench/EngiBench.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs on real-world engineering problems with uncertainty and context

Assessing model performance across hierarchical difficulty levels in engineering

Measuring robustness and reasoning gaps between LLMs and human experts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical benchmark with three difficulty levels

Systematic problem rewriting into controlled variants

Separate evaluation of robustness, knowledge, and reasoning

🔎 Similar Papers

No similar papers found.