ProMQA-Assembly: Multimodal Procedural QA Dataset on Assembly

📅 2025-09-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing assembly task benchmarks lack application-oriented, systematic evaluation frameworks for procedural activity assistants. Method: We introduce AssemblyQA—the first multimodal question-answering dataset tailored to assembly scenarios—comprising 391 question-answer pairs requiring joint reasoning over assembly videos and instruction manuals. We propose a semi-automated annotation framework leveraging large language models for initial QA generation, augmented by human verification, multimodal temporal alignment, instruction-task graph modeling, and fine-grained action labeling to enhance question diversity and annotation efficiency. Contribution/Results: Rigorous quality control ensures high fidelity. Benchmarking reveals severe limitations of current multimodal large models on this task (average accuracy: 42.7%), exposing critical deficits in procedural reasoning. AssemblyQA establishes a reproducible, extensible benchmark for evaluating and advancing procedural assistants.

Technology Category

Application Category

📝 Abstract
Assistants on assembly tasks have a large potential to benefit humans from everyday tasks to industrial settings. However, no testbeds support application-oriented system evaluation in a practical setting, especially in assembly. To foster the development, we propose a new multimodal QA dataset on assembly activities. Our dataset, ProMQA-Assembly, consists of 391 QA pairs that require the multimodal understanding of human-activity recordings and their instruction manuals in an online-style manner. In the development, we adopt a semi-automated QA annotation approach, where LLMs generate candidates and humans verify them, as a cost-effective method, and further improve it by integrating fine-grained action labels to diversify question types. Furthermore, we create instruction task graphs for the target tasks of assembling toy vehicles. These newly created task graphs are used in our benchmarking experiment, as well as to facilitate the human verification process in the QA annotation. Utilizing our dataset, we benchmark models, including competitive proprietary multimodal models. Our results suggest great room for improvement for the current models. We believe our new evaluation dataset can contribute to the further development of procedural-activity assistants.
Problem

Research questions and friction points this paper is trying to address.

Lack of testbeds for assembly assistant evaluation
Need multimodal understanding of activity recordings and manuals
Requires procedural QA dataset for assembly tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal QA dataset for assembly activities
Semi-automated LLM-human QA annotation approach
Instruction task graphs for assembly benchmarking
🔎 Similar Papers
No similar papers found.
K
Kimihiro Hasegawa
Language Technologies Institute, Carnegie Mellon University
W
Wiradee Imrattanatrai
National Institute of Advanced Industrial Science and Technology (AIST)
M
Masaki Asada
National Institute of Advanced Industrial Science and Technology (AIST)
S
Susan Holm
Language Technologies Institute, Carnegie Mellon University
Y
Yuran Wang
Language Technologies Institute, Carnegie Mellon University
V
Vincent Zhou
Language Technologies Institute, Carnegie Mellon University
K
Ken Fukuda
National Institute of Advanced Industrial Science and Technology (AIST)
Teruko Mitamura
Teruko Mitamura
Research Professor of Language Technologies Institute, School of Computer Science, Carnegie Mellon
Natural Language ProcessingQuestion AnsweringJapanese NLPSemanticsEvents