Advancing Tool-Augmented Large Language Models via Meta-Verification and Reflection Learning

📅 2025-06-05

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

To address two critical bottlenecks in LLM-based tool utilization—*unreliable planning* (caused by low-quality instruction data, e.g., spurious API calls) and *weak reflection capability* (due to static imitation learning, leaving >90% errors uncorrected)—this paper proposes Tool-MVR. It introduces a Multi-Agent Meta-Verification (MAMV) mechanism to construct the high-quality instruction dataset ToolBench-V, and pioneers an exploration-driven reflection learning (EXPLORE) paradigm to build the first dedicated tool-reflection dataset, ToolBench-R. By integrating multi-agent verification, dynamic error reflection-correction loops, and instruction fine-tuning of open-source models (e.g., Qwen-7B), Tool-MVR endows models with systematic two-level reasoning. On StableToolBench, it outperforms ToolLLM by 23.9% and GPT-4 by 15.3%, while reducing API call volume by 31.4%. On RefineToolBench, its error correction rate reaches 58.9%, substantially surpassing ToolLLM’s 9.1%.

Technology Category

Application Category

📝 Abstract

Empowering large language models (LLMs) with effective tool utilization capabilities is crucial for enabling AI agents to solve complex problems. However, current models face two major limitations: (1) unreliable tool planning and invocation due to low-quality instruction datasets (e.g., widespread hallucinated API calls), and (2) weak tool reflection abilities (over 90% of errors cannot be corrected) resulting from static imitation learning. To address these critical limitations, we propose Tool-MVR, a novel Tool-Augmented LLM that achieves comprehensive System 2 reasoning through two key innovations. Specifically, we first introduce Multi-Agent Meta-Verification (MAMV), a systematic pipeline that rigorously validates APIs, queries, and reasoning trajectories to construct ToolBench-V, a new high-quality instruction dataset that addresses the limitation of unreliable tool planning and invocation. Second, we propose Exploration-based Reflection Learning (EXPLORE), which enhances tool reflection capabilities by leveraging tool feedback through a dynamic"Error ->Reflection ->Correction"learning paradigm, resulting in our reflection dataset ToolBench-R and addressing the critical weakness in tool reflection. Finally, we obtain Tool-MVR by finetuning open-source LLMs (e.g., Qwen-7B) on both ToolBench-V and ToolBench-R. Our experiments demonstrate that Tool-MVR achieves state-of-the-art performance on StableToolBench, surpassing both ToolLLM (by 23.9%) and GPT-4 (by 15.3%) while reducing API calls by 31.4%, with strong generalization capabilities across unseen tools and scenarios. Additionally, on our proposed RefineToolBench, the first benchmark specifically designed to evaluate tool reflection capabilities, Tool-MVR achieves a 58.9% error correction rate, significantly outperforming ToolLLM's 9.1%.

Problem

Research questions and friction points this paper is trying to address.

Unreliable tool planning and invocation in LLMs

Weak tool reflection abilities in current models

Need for high-quality datasets and dynamic learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Agent Meta-Verification for reliable tool planning

Exploration-based Reflection Learning for error correction

Fine-tuning LLMs on high-quality datasets ToolBench-V and ToolBench-R

🔎 Similar Papers

Can Tool-augmented Large Language Models be Aware of Incomplete Conditions?

2024-06-18arXiv.orgCitations: 2

💼 Related Jobs

Machine Learning Engineer - Agentic AI

Apple

Sunnyvale, United States of America

Research Scientist Intern, Multimodal AI (PhD)