FactoryBench: Evaluating Industrial Machine Understanding

📅 2026-05-08
📈 Citations: 0
Influential: 0
📄 PDF

career value

226K/year
🤖 AI Summary
This work addresses the lack of a systematic evaluation benchmark for assessing the machine comprehension capabilities of large language models (LLMs) on industrial robot telemetry data. To bridge this gap, we introduce FactoryBench, the first industrial understanding benchmark grounded in Pearl’s ladder of causal reasoning—spanning association, intervention, counterfactuals, and decision-making. FactoryBench features a scalable, template-based question generation framework that integrates multi-source, multivariate time-series sensor data and employs an LLM-as-judge voting mechanism for automated scoring. The benchmark comprises over 70,000 structured question-answer pairs. In zero-shot evaluations across six state-of-the-art LLMs, overall accuracy remains below 50%, with performance on decision-making tasks falling under 18%, starkly revealing the substantial gap between current model capabilities and real-world industrial understanding requirements.
📝 Abstract
We introduce FactoryBench, a benchmark for evaluating time-series models and LLMs on machine understanding over industrial robotic telemetry. Q&A pairs are organized along four causal levels (state, intervention, counterfactual, decision) instantiating Pearl's ladder of causation, and span five answer formats: four structured formats are scored deterministically and free-form answers are scored by an LLM-as-judge voting protocol. We propose a scalable Q&A generation framework built around structured question templates, present FactoryWave (a dense, multitask, multivariate sensor dataset collected from a UR3 cobot and a KUKA KR10 industrial arm), and construct FactoryBench as a large-scale benchmark of over 70k Q&A items grounded in roughly 15k normalized episodes from FactoryWave, AURSAD, and voraus-AD. Zero-shot evaluation of six frontier LLMs shows that no model exceeds 50% on structured levels or 18% on decision-making, revealing a wide gap between current models and operational machine understanding.
Problem

Research questions and friction points this paper is trying to address.

machine understanding
industrial robotics
causal reasoning
time-series models
LLM evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

FactoryBench
causal reasoning
industrial machine understanding
LLM-as-judge
structured question templates
🔎 Similar Papers