MindWatcher: Toward Smarter Multimodal Tool-Integrated Reasoning

📅 2025-12-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Conventional workflow-based agents lack autonomy and generalization in tool invocation and cross-environment decision-making. Method: We propose MindWatcher, the first agent featuring an interleaved reasoning paradigm integrated with multimodal chain-of-reasoning, enabling autonomous multimodal tool selection and interleaved inference without manual prompting or predefined pipelines. Our approach incorporates automated data auditing, local high-precision image retrieval (across eight categories), and a lightweight training architecture; theoretically, we identify a “genetic inheritance” phenomenon in agent reinforcement learning; practically, we introduce MWE-Bench—the first dedicated benchmark for tool-integrated multimodal reasoning. Results: Experiments demonstrate that MindWatcher matches or surpasses larger models across diverse multimodal tasks, significantly improving tool invocation accuracy and reasoning efficiency. This validates that compact models, when intelligently coordinated with tools, achieve strong generalization—challenging the necessity of scale alone.

Technology Category

Application Category

📝 Abstract
Traditional workflow-based agents exhibit limited intelligence when addressing real-world problems requiring tool invocation. Tool-integrated reasoning (TIR) agents capable of autonomous reasoning and tool invocation are rapidly emerging as a powerful approach for complex decision-making tasks involving multi-step interactions with external environments. In this work, we introduce MindWatcher, a TIR agent integrating interleaved thinking and multimodal chain-of-thought (CoT) reasoning. MindWatcher can autonomously decide whether and how to invoke diverse tools and coordinate their use, without relying on human prompts or workflows. The interleaved thinking paradigm enables the model to switch between thinking and tool calling at any intermediate stage, while its multimodal CoT capability allows manipulation of images during reasoning to yield more precise search results. We implement automated data auditing and evaluation pipelines, complemented by manually curated high-quality datasets for training, and we construct a benchmark, called MindWatcher-Evaluate Bench (MWE-Bench), to evaluate its performance. MindWatcher is equipped with a comprehensive suite of auxiliary reasoning tools, enabling it to address broad-domain multimodal problems. A large-scale, high-quality local image retrieval database, covering eight categories including cars, animals, and plants, endows model with robust object recognition despite its small size. Finally, we design a more efficient training infrastructure for MindWatcher, enhancing training speed and hardware utilization. Experiments not only demonstrate that MindWatcher matches or exceeds the performance of larger or more recent models through superior tool invocation, but also uncover critical insights for agent training, such as the genetic inheritance phenomenon in agentic RL.
Problem

Research questions and friction points this paper is trying to address.

Develops autonomous tool-integrated reasoning for complex tasks
Enhances multimodal reasoning with interleaved thinking and image manipulation
Creates efficient training and evaluation for broad-domain multimodal agents
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates interleaved thinking and multimodal chain-of-thought reasoning
Uses automated data auditing and high-quality datasets for training
Implements a comprehensive suite of auxiliary reasoning tools
🔎 Similar Papers
No similar papers found.
J
Jiawei Chen
Li Auto Inc
X
Xintian Shen
Li Auto Inc
L
Lihao Zheng
Li Auto Inc
Z
Zhenwei Shao
Li Auto Inc
H
Hongyuan Zhang
Li Auto Inc
Pengfei Yu
Pengfei Yu
University of Illinois at Urbana-Champaign
Natural Language ProcessingMachine Learning
X
Xudong Rao
Li Auto Inc
N
Ning Mao
Li Auto Inc
X
Xiaobo Liu
Li Auto Inc
Lian Wen
Lian Wen
Lecturer of ICT, Griffith University
Software EngineeringArtificial Intelligence
Chaoqun Du
Chaoqun Du
Department of Automation, Tsinghua University
F
Feng Gu
Li Auto Inc
W
Wei He
Li Auto Inc
Q
Qizhen Li
Li Auto Inc
S
Shanshan Li
Li Auto Inc
Zide Liu
Zide Liu
Zhejiang University
Diffusion ModelsVideo Editing
Jing Luo
Jing Luo
Shandong University
Natural Language Processing
L
Lifu Mu
Li Auto Inc
X
Xuhao Pan
Li Auto Inc
C
Chang Ren
Li Auto Inc
H
Haoyi Sun
Li Auto Inc
Q
Qian Wang
Li Auto Inc
W
Wei Wang
Li Auto Inc
H
Hongfu Yang
Li Auto Inc
J
Jiqing Zhan
Li Auto Inc