Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models

📅 2026-04-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge faced by multimodal agents in balancing internal knowledge and external tool usage during decision-making, which often leads to redundant tool calls, reasoning delays, and susceptibility to noise. To mitigate these issues, the authors propose the HDPO framework, which innovatively reformulates tool efficiency from a competitive scalar objective into a conditional one, thereby decoupling accuracy optimization from tool invocation. HDPO employs a dual-channel reinforcement learning mechanism that constrains tool usage exclusively along correct reasoning trajectories, enabling curriculum-style cognitive learning. The resulting model, Metis, significantly reduces the number of tool calls while simultaneously improving reasoning accuracy.
📝 Abstract
The advent of agentic multimodal models has empowered systems to actively interact with external environments. However, current agents suffer from a profound meta-cognitive deficit: they struggle to arbitrate between leveraging internal knowledge and querying external utilities. Consequently, they frequently fall prey to blind tool invocation, resorting to reflexive tool execution even when queries are resolvable from the raw visual context. This pathological behavior precipitates severe latency bottlenecks and injects extraneous noise that derails sound reasoning. Existing reinforcement learning protocols attempt to mitigate this via a scalarized reward that penalizes tool usage. Yet, this coupled formulation creates an irreconcilable optimization dilemma: an aggressive penalty suppresses essential tool use, whereas a mild penalty is entirely subsumed by the variance of the accuracy reward during advantage normalization, rendering it impotent against tool overuse. To transcend this bottleneck, we propose HDPO, a framework that reframes tool efficiency from a competing scalar objective to a strictly conditional one. By eschewing reward scalarization, HDPO maintains two orthogonal optimization channels: an accuracy channel that maximizes task correctness, and an efficiency channel that enforces execution economy exclusively within accurate trajectories via conditional advantage estimation. This decoupled architecture naturally induces a cognitive curriculum-compelling the agent to first master task resolution before refining its self-reliance. Extensive evaluations demonstrate that our resulting model, Metis, reduces tool invocations by orders of magnitude while simultaneously elevating reasoning accuracy.
Problem

Research questions and friction points this paper is trying to address.

meta-cognitive deficit
tool overuse
agentic multimodal models
reasoning accuracy
tool invocation
Innovation

Methods, ideas, or system contributions that make the work stand out.

meta-cognitive tool use
agentic multimodal models
conditional advantage estimation
decoupled optimization
tool efficiency
🔎 Similar Papers
No similar papers found.