FieldWorkArena: Agentic AI Benchmark for Real Field Work Tasks

📅 2025-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing AI agent benchmarks are largely confined to web-based tasks, failing to adequately assess embodied intelligence in real-world industrial environments such as factories and warehouses. This paper introduces IndustrialAgentBench—the first benchmark explicitly designed for physical workplace scenarios—covering multimodal tasks including safety inspections and anomaly reporting. It is constructed from field-collected videos, authentic operational documents, and frontline worker interviews. We innovatively define a physics-aware action space tailored to industrial settings, design an evaluation function compatible with multimodal large language models (e.g., GPT-4o), and establish a comprehensive task modeling framework and quantitative metrics integrating visual, textual, and structured instruction modalities. Experiments demonstrate the feasibility of rigorously evaluating MLLMs on real-world industrial tasks. We open-source the full dataset (Hugging Face) and evaluation code (GitHub), revealing critical performance boundaries and fundamental bottlenecks of current approaches.

Technology Category

Application Category

📝 Abstract
This paper proposes FieldWorkArena, a benchmark for agentic AI targeting real-world field work. With the recent increase in demand for agentic AI, they are required to monitor and report safety and health incidents, as well as manufacturing-related incidents, that may occur in real-world work environments. Existing agentic AI benchmarks have been limited to evaluating web tasks and are insufficient for evaluating agents in real-world work environments, where complexity increases significantly. In this paper, we define a new action space that agentic AI should possess for real world work environment benchmarks and improve the evaluation function from previous methods to assess the performance of agentic AI in diverse real-world tasks. The dataset consists of videos captured on-site and documents actually used in factories and warehouses, and tasks were created based on interviews with on-site workers and managers. Evaluation results confirmed that performance evaluation considering the characteristics of Multimodal LLM (MLLM) such as GPT-4o is feasible. Additionally, the effectiveness and limitations of the proposed new evaluation method were identified. The complete dataset (HuggingFace) and evaluation program (GitHub) can be downloaded from the following website: https://en-documents.research.global.fujitsu.com/fieldworkarena/.
Problem

Research questions and friction points this paper is trying to address.

Proposes FieldWorkArena benchmark for agentic AI in real-world field work
Defines new action space for AI in real-world work environments
Improves evaluation function for diverse real-world tasks performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Defines new action space for real-world AI agents
Improves evaluation function for diverse tasks
Uses on-site videos and factory documents dataset
🔎 Similar Papers
No similar papers found.
A
Atsunori Moteki
Fujitsu Limited, Japan
S
Shoichi Masui
Fujitsu Limited, Japan
F
Fan Yang
Fujitsu Research of America, USA
Yueqi Song
Yueqi Song
BS/MS student, Carnegie Mellon University
AI AgentsMultimodal NLPMultilingual NLP
Yonatan Bisk
Yonatan Bisk
Assistant Professor, Carnegie Mellon University
Natural Language ProcessingEmbodied AIRobot Learning
Graham Neubig
Graham Neubig
Carnegie Mellon University, All Hands AI
Natural Language ProcessingMachine LearningArtificial Intelligence
I
Ikuo Kusajima
Fujitsu Limited, Japan
Y
Yasuto Watanabe
Fujitsu Limited, Japan
H
Hiroyuki Ishida
Fujitsu Limited, Japan
J
Jun Takahashi
Fujitsu Limited, Japan
S
Shan Jiang
Fujitsu Limited, Japan