TAMA: Tool-Augmented Multimodal Agent for Procedural Activity Understanding

📅 2025-09-30

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Effective multimodal understanding and assistance for procedural activities—such as cooking, furniture assembly, and laboratory operations—remains an open challenge due to the lack of generalizable frameworks. This paper introduces TAMA, a fine-tuning-free, tool-augmented multimodal agent framework that enables cross-modal interleaved reasoning via multimedia return tools and an agent-driven dynamic tool selection mechanism. Unlike prior approaches, TAMA requires no task-specific training and operates solely on off-the-shelf vision-language models (e.g., GPT-5, MiMo-VL) to jointly comprehend video, text, and action sequences. Its key contribution is the extension of the image-thinking paradigm to long-horizon procedural tasks, coupled with plug-and-play tool integration. On the ProMQA-Assembly benchmark, TAMA substantially outperforms baseline models. Ablation studies validate the critical roles of both the tool scheduling mechanism and the multimedia feedback module.

Technology Category

Application Category

📝 Abstract

Procedural activity assistants potentially support humans in a variety of settings, from our daily lives, e.g., cooking or assembling flat-pack furniture, to professional situations, e.g., manufacturing or biological experiments. Despite its potential use cases, the system development tailored for such an assistant is still underexplored. In this paper, we propose a novel framework, called TAMA, a Tool-Augmented Multimodal Agent, for procedural activity understanding. TAMA enables interleaved multimodal reasoning by making use of multimedia-returning tools in a training-free setting. Our experimental result on the multimodal procedural QA dataset, ProMQA-Assembly, shows that our approach can improve the performance of vision-language models, especially GPT-5 and MiMo-VL. Furthermore, our ablation studies provide empirical support for the effectiveness of two features that characterize our framework, multimedia-returning tools and agentic flexible tool selection. We believe our proposed framework and experimental results facilitate the thinking with images paradigm for video and multimodal tasks, let alone the development of procedural activity assistants.

Problem

Research questions and friction points this paper is trying to address.

Developing multimodal agents for procedural activity understanding tasks

Enhancing vision-language models through tool-augmented reasoning

Improving performance on multimodal procedural QA benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Tool-Augmented Multimodal Agent for procedural understanding

Training-free interleaved multimodal reasoning with tools

Multimedia-returning tools and flexible tool selection

🔎 Similar Papers

MLLM-Tool: A Multimodal Large Language Model for Tool Agent Learning