🤖 AI Summary
Effective multimodal understanding and assistance for procedural activities—such as cooking, furniture assembly, and laboratory operations—remains an open challenge due to the lack of generalizable frameworks. This paper introduces TAMA, a fine-tuning-free, tool-augmented multimodal agent framework that enables cross-modal interleaved reasoning via multimedia return tools and an agent-driven dynamic tool selection mechanism. Unlike prior approaches, TAMA requires no task-specific training and operates solely on off-the-shelf vision-language models (e.g., GPT-5, MiMo-VL) to jointly comprehend video, text, and action sequences. Its key contribution is the extension of the image-thinking paradigm to long-horizon procedural tasks, coupled with plug-and-play tool integration. On the ProMQA-Assembly benchmark, TAMA substantially outperforms baseline models. Ablation studies validate the critical roles of both the tool scheduling mechanism and the multimedia feedback module.
📝 Abstract
Procedural activity assistants potentially support humans in a variety of settings, from our daily lives, e.g., cooking or assembling flat-pack furniture, to professional situations, e.g., manufacturing or biological experiments. Despite its potential use cases, the system development tailored for such an assistant is still underexplored. In this paper, we propose a novel framework, called TAMA, a Tool-Augmented Multimodal Agent, for procedural activity understanding. TAMA enables interleaved multimodal reasoning by making use of multimedia-returning tools in a training-free setting. Our experimental result on the multimodal procedural QA dataset, ProMQA-Assembly, shows that our approach can improve the performance of vision-language models, especially GPT-5 and MiMo-VL. Furthermore, our ablation studies provide empirical support for the effectiveness of two features that characterize our framework, multimedia-returning tools and agentic flexible tool selection. We believe our proposed framework and experimental results facilitate the thinking with images paradigm for video and multimodal tasks, let alone the development of procedural activity assistants.