FactoryBench: Evaluating Industrial Machine Understanding

📅 2026-05-08

📈 Citations: 0

✨ Influential: 0

career value

236K/year

🤖 AI Summary

This work addresses the lack of a systematic evaluation benchmark for assessing the machine comprehension capabilities of large language models (LLMs) on industrial robot telemetry data. To bridge this gap, we introduce FactoryBench, the first industrial understanding benchmark grounded in Pearl’s ladder of causal reasoning—spanning association, intervention, counterfactuals, and decision-making. FactoryBench features a scalable, template-based question generation framework that integrates multi-source, multivariate time-series sensor data and employs an LLM-as-judge voting mechanism for automated scoring. The benchmark comprises over 70,000 structured question-answer pairs. In zero-shot evaluations across six state-of-the-art LLMs, overall accuracy remains below 50%, with performance on decision-making tasks falling under 18%, starkly revealing the substantial gap between current model capabilities and real-world industrial understanding requirements.

📝 Abstract

We introduce FactoryBench, a benchmark for evaluating time-series models and LLMs on machine understanding over industrial robotic telemetry. Q&A pairs are organized along four causal levels (state, intervention, counterfactual, decision) instantiating Pearl's ladder of causation, and span five answer formats: four structured formats are scored deterministically and free-form answers are scored by an LLM-as-judge voting protocol. We propose a scalable Q&A generation framework built around structured question templates, present FactoryWave (a dense, multitask, multivariate sensor dataset collected from a UR3 cobot and a KUKA KR10 industrial arm), and construct FactoryBench as a large-scale benchmark of over 70k Q&A items grounded in roughly 15k normalized episodes from FactoryWave, AURSAD, and voraus-AD. Zero-shot evaluation of six frontier LLMs shows that no model exceeds 50% on structured levels or 18% on decision-making, revealing a wide gap between current models and operational machine understanding.

Problem

Research questions and friction points this paper is trying to address.

machine understanding

industrial robotics

causal reasoning

time-series models

LLM evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

FactoryBench

causal reasoning

industrial machine understanding