🤖 AI Summary
This work addresses the critical bottleneck in industrial asset maintenance, where translating symbolic rules into repair actions heavily relies on expert experience. To facilitate systematic evaluation, the authors introduce the first benchmark dataset for this task, comprising 6,690 expert-validated multiple-choice questions, and propose a standardized pipeline to probe failure modes of large language models (LLMs) under structural perturbations through five carefully designed variants. The methodology integrates conjunctive normal form normalization, embedding-driven distractor sampling, multiple-choice question generation, and Elo-based ranking aligned with human evaluation. Comprehensive assessment across 29 LLMs and four embedding baselines reveals that, despite near-parity among top-performing models, they commonly exhibit sensitivity to distractor expansion and erroneous pattern matching under condition reversal. Notably, even human experts achieve only a 45.0% average accuracy, underscoring both the inherent difficulty of the task and the insufficient calibration of current models.
📝 Abstract
Monitoring complex industrial assets relies on engineer-authored symbolic rules that trigger based on sensor conditions and prompt technicians to perform corrective actions. The bottleneck is not detection but response: translating rules into maintenance steps requires asset-specific knowledge gained through years of practice. We investigate whether LLMs can serve as decision support for this rule-to-action step and introduce \ours{}, a benchmark of 6{,}690 expert-validated multiple-choice questions from 118 rule-action pairs across 16 asset types. We contribute (i) a symbolic-to-MCQA pipeline normalizing rules to Disjunctive Normal Form with embedding-based distractor sampling, (ii) five variants probing distinct failure modes (Pro, Pert, Verbose, Aug, Rationale), and (iii) a benchmark of 29 LLMs and 4 embedding baselines. A human evaluation (9 practitioners, mean 45.0\%) confirms \ours{} requires specialist knowledge beyond operational experience. Three findings stand out. The frontier has closed: the top three LLMs lie within one Macro point, with Bradley-Terry Elo placing claude-opus-4-6 30 points above the next model. Yet \ours{}\,Pro exposes brittleness, with every model losing 13--60\% relative accuracy under distractor expansion. \ours{}\,Aug exposes pattern-matching: under condition inversion, frontier models still select the original answer 49--63\% of the time. The deployment bottleneck is not capability but calibration: frontier models handle template-style fault detection but break under structural perturbation.