PHMForge: A Scenario-Driven Agentic Benchmark for Industrial Asset Lifecycle Maintenance

📅 2026-04-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing benchmarks for large language model agents fall short in rigorously evaluating the safety and reliability demands of industrial asset lifecycle maintenance. This work proposes the first scenario-driven agent evaluation benchmark tailored for Prognostics and Health Management (PHM), integrating 75 expert-designed scenarios, 65 domain-specific tools, and multiple industrial asset types. It uniquely combines multidimensional PHM tasks with tool interaction via MCP servers and introduces an execution-level evaluation framework with task-adapted quantitative metrics. End-to-end evaluations on state-of-the-art models—including GPT-4o and Claude Sonnet 4.0—reveal a maximum task completion rate of only 68%, highlighting critical bottlenecks in tool orchestration, multi-asset reasoning, and cross-device generalization. The benchmark is publicly released to advance research in industrial AI agents.
📝 Abstract
Large language model (LLM) agents are increasingly deployed for complex tool-orchestration tasks, yet existing benchmarks fail to capture the rigorous demands of industrial domains where incorrect decisions carry significant safety and financial consequences. To address this critical gap, we introduce PHMForge, the first comprehensive benchmark specifically designed to evaluate LLM agents on Prognostics and Health Management (PHM) tasks through realistic interactions with domain-specific MCP servers. Our benchmark encompasses 75 expert-curated scenarios spanning 7 industrial asset classes (turbofan engines, bearings, electric motors, gearboxes, aero-engines) across 5 core task categories: Remaining Useful Life (RUL) Prediction, Fault Classification, Engine Health Analysis, Cost-Benefit Analysis, and Safety/Policy Evaluation. To enable rigorous evaluation, we construct 65 specialized tools across two MCP servers and implement execution-based evaluators with task-commensurate metrics: MAE/RMSE for regression, F1-score for classification, and categorical matching for health assessments. Through extensive evaluation of leading frameworks (ReAct, Cursor Agent, Claude Code) paired with frontier LLMs (Claude Sonnet 4.0, GPT-4o, Granite-3.0-8B), we find that even top-performing configurations achieve only 68\% task completion, with systematic failures in tool orchestration (23\% incorrect sequencing), multi-asset reasoning (14.9 percentage point degradation), and cross-equipment generalization (42.7\% on held-out datasets). We open-source our complete benchmark, including scenario specifications, ground truth templates, tool implementations, and evaluation scripts, to catalyze research in agentic industrial AI.
Problem

Research questions and friction points this paper is trying to address.

Prognostics and Health Management
LLM agents
industrial asset maintenance
benchmark
tool orchestration
Innovation

Methods, ideas, or system contributions that make the work stand out.

PHMForge
LLM agents
Prognostics and Health Management
MCP servers
scenario-driven benchmark
🔎 Similar Papers
No similar papers found.