TAMA: Tool-Augmented Multimodal Agent for Procedural Activity Understanding

📅 2025-09-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Effective multimodal understanding and assistance for procedural activities—such as cooking, furniture assembly, and laboratory operations—remains an open challenge due to the lack of generalizable frameworks. This paper introduces TAMA, a fine-tuning-free, tool-augmented multimodal agent framework that enables cross-modal interleaved reasoning via multimedia return tools and an agent-driven dynamic tool selection mechanism. Unlike prior approaches, TAMA requires no task-specific training and operates solely on off-the-shelf vision-language models (e.g., GPT-5, MiMo-VL) to jointly comprehend video, text, and action sequences. Its key contribution is the extension of the image-thinking paradigm to long-horizon procedural tasks, coupled with plug-and-play tool integration. On the ProMQA-Assembly benchmark, TAMA substantially outperforms baseline models. Ablation studies validate the critical roles of both the tool scheduling mechanism and the multimedia feedback module.

Technology Category

Application Category

📝 Abstract
Procedural activity assistants potentially support humans in a variety of settings, from our daily lives, e.g., cooking or assembling flat-pack furniture, to professional situations, e.g., manufacturing or biological experiments. Despite its potential use cases, the system development tailored for such an assistant is still underexplored. In this paper, we propose a novel framework, called TAMA, a Tool-Augmented Multimodal Agent, for procedural activity understanding. TAMA enables interleaved multimodal reasoning by making use of multimedia-returning tools in a training-free setting. Our experimental result on the multimodal procedural QA dataset, ProMQA-Assembly, shows that our approach can improve the performance of vision-language models, especially GPT-5 and MiMo-VL. Furthermore, our ablation studies provide empirical support for the effectiveness of two features that characterize our framework, multimedia-returning tools and agentic flexible tool selection. We believe our proposed framework and experimental results facilitate the thinking with images paradigm for video and multimodal tasks, let alone the development of procedural activity assistants.
Problem

Research questions and friction points this paper is trying to address.

Developing multimodal agents for procedural activity understanding tasks
Enhancing vision-language models through tool-augmented reasoning
Improving performance on multimodal procedural QA benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Tool-Augmented Multimodal Agent for procedural understanding
Training-free interleaved multimodal reasoning with tools
Multimedia-returning tools and flexible tool selection
🔎 Similar Papers
No similar papers found.
K
Kimihiro Hasegawa
Language Technologies Institute, Carnegie Mellon University
W
Wiradee Imrattanatrai
National Institute of Advanced Industrial Science and Technology (AIST)
M
Masaki Asada
National Institute of Advanced Industrial Science and Technology (AIST)
K
Ken Fukuda
National Institute of Advanced Industrial Science and Technology (AIST)
Teruko Mitamura
Teruko Mitamura
Research Professor of Language Technologies Institute, School of Computer Science, Carnegie Mellon
Natural Language ProcessingQuestion AnsweringJapanese NLPSemanticsEvents