EngiAI: A Multi-Agent Framework and Benchmark Suite for LLM-Driven Engineering Design

📅 2026-05-19
📈 Citations: 0
Influential: 0
📄 PDF

career value

223K/year
🤖 AI Summary
Existing evaluation frameworks struggle to support multi-agent engineering design systems that integrate simulation, retrieval, and manufacturing preparation. This work proposes the first multidimensional benchmark suite tailored for engineering design and introduces a LangGraph-based multi-agent architecture that orchestrates topology optimization, document retrieval, HPC job scheduling, and 3D printing tasks within a unified workflow. The system integrates retrieval-augmented generation (RAG), SLURM cluster orchestration, and conditional branching logic. Experimental results demonstrate that closed-source models achieve an average task completion rate of 96–97% on the Beams2D benchmark, while open-source 4B-parameter models attain 55–78%. RAG substantially improves the accuracy of parameter selection, whereas tasks involving conditional branching exhibit the lowest completion rates, dropping to 20–53%.
📝 Abstract
Large Language Model (LLM) agents are increasingly applied to engineering design tasks, yet existing evaluation frameworks do not adequately address multi-agent systems that combine simulation, retrieval, and manufacturing preparation. We introduce a benchmark suite with three evaluation dimensions: (1) a workflow benchmark with seven prompt styles targeting distinct cognitive demands-including direct tool use, semantic disambiguation, conditional branching, and working-memory tasks; (2) a Retrieval-Augmented Generation (RAG) benchmark with gated scoring isolating retrieval contributions to parameter selection; and (3) an High Performance Computing (HPC) benchmark evaluating end-to-end ML training orchestration on a SLURM cluster. Alongside the benchmark we present EngiAI, a Multi-Agent System (MAS) reference implementation built on LangGraph that operationalizes the benchmark by coordinating seven specialized agents through a supervisor architecture, unifying topology optimization, document retrieval, HPC job orchestration, and 3D printer control. Across four LLM backends and two EngiBench problems, proprietary models achieve 96-97% average task completion on Beams2D, while open-source 4B-parameter models reach 55-78%, with clear generational improvement. Conditional branching proves most challenging, with task completion dropping to 20-53% for the conditional style on Photonics2D. RAG gating confirms near-perfect retrieval-augmented scores ($\approx 1.0$) versus near-zero without retrieval, validating the evaluation design. On HPC orchestration, one model completes all pipeline steps in 100% of runs while another drops to 50%, revealing that multi-step instruction following degrades over long-running workflows.
Problem

Research questions and friction points this paper is trying to address.

multi-agent system
engineering design
evaluation framework
LLM agents
benchmark suite
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Agent System
Retrieval-Augmented Generation
High Performance Computing
Engineering Design Automation
LLM Benchmarking