FailureSensorIQ: A Multi-Choice QA Dataset for Understanding Sensor Relationships and Failure Modes

📅 2025-06-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the limited domain reasoning capabilities of large language models (LLMs) in Industrial 4.0—particularly regarding fault patterns, sensor data interpretation, and inter-device relational modeling—this paper introduces FailureSensorIQ, the first expert-annotated multiple-choice QA benchmark grounded in ISO industrial standards. We propose the Perturbation-Uncertainty-Complexity (PUC) analytical framework to systematically evaluate LLMs across three critical dimensions: perturbation robustness, uncertainty modeling, and logical complexity. To enhance reasoning, we design a ReAct-based agent that dynamically retrieves external domain knowledge, and release LLMFeatureSelector, a modular, integrable feature selection toolchain. Experiments reveal substantial industrial knowledge gaps and sensitivity to input perturbations among mainstream LLMs; while GPT-4 approaches expert-level performance, its generalization remains constrained. We open-source the benchmark, evaluation platform, and toolchain, establishing a new paradigm for trustworthy, standardized assessment of industrial LLMs.

Technology Category

Application Category

📝 Abstract
We introduce FailureSensorIQ, a novel Multi-Choice Question-Answering (MCQA) benchmarking system designed to assess the ability of Large Language Models (LLMs) to reason and understand complex, domain-specific scenarios in Industry 4.0. Unlike traditional QA benchmarks, our system focuses on multiple aspects of reasoning through failure modes, sensor data, and the relationships between them across various industrial assets. Through this work, we envision a paradigm shift where modeling decisions are not only data-driven using statistical tools like correlation analysis and significance tests, but also domain-driven by specialized LLMs which can reason about the key contributors and useful patterns that can be captured with feature engineering. We evaluate the Industrial knowledge of over a dozen LLMs-including GPT-4, Llama, and Mistral-on FailureSensorIQ from different lens using Perturbation-Uncertainty-Complexity analysis, Expert Evaluation study, Asset-Specific Knowledge Gap analysis, ReAct agent using external knowledge-bases. Even though closed-source models with strong reasoning capabilities approach expert-level performance, the comprehensive benchmark reveals a significant drop in performance that is fragile to perturbations, distractions, and inherent knowledge gaps in the models. We also provide a real-world case study of how LLMs can drive the modeling decisions on 3 different failure prediction datasets related to various assets. We release: (a) expert-curated MCQA for various industrial assets, (b) FailureSensorIQ benchmark and Hugging Face leaderboard based on MCQA built from non-textual data found in ISO documents, and (c) LLMFeatureSelector, an LLM-based feature selection scikit-learn pipeline. The software is available at https://github.com/IBM/FailureSensorIQ.
Problem

Research questions and friction points this paper is trying to address.

Assessing LLMs' reasoning in Industry 4.0 sensor-failure scenarios
Evaluating industrial knowledge gaps in LLMs under perturbations
Enabling domain-driven modeling decisions via LLM-based feature selection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Choice QA for sensor failure reasoning
Perturbation-Uncertainty-Complexity analysis for LLMs
LLM-based feature selection pipeline
Christodoulos Constantinides
Christodoulos Constantinides
IBM
D
Dhaval Patel
IBM TJ Watson Research Center
Shuxin Lin
Shuxin Lin
IBM Research
Computer ScienceData ScienceMachine LearingArtificial Intelligence
C
Claudio Guerrero
IBM
S
Sunil Dagajirao Patil
IBM
J
Jayant Kalagnanam
IBM TJ Watson Research Center