ToolMind Technical Report: A Large-Scale, Reasoning-Enhanced Tool-Use Dataset

📅 2025-11-12

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

High-quality tool-use trajectory data is scarce, hindering large language models’ reliable invocation of external tools in complex tasks. Existing approaches predominantly rely on multi-turn dialogue-level correctness verification, which fails to effectively suppress error propagation at the turn level. This paper introduces ToolMind, a novel framework that models tool semantics via a function graph and simulates realistic interactions through a tri-agent collaboration—comprising user, assistant, and tool agents—to generate high-fidelity, multi-turn trajectories. It further proposes a turn-level fine-grained filtering mechanism that actively identifies and removes erroneous steps during training while preserving self-correcting reasoning signals. Combined with parameter correlation analysis and synthetic data augmentation, ToolMind significantly enhances trajectory quality. Empirical evaluation across multiple benchmarks demonstrates that ToolMind-finetuned models substantially outperform state-of-the-art baselines in both tool-call accuracy and multi-step reasoning capability.

Technology Category

Application Category

📝 Abstract

Large Language Model (LLM) agents have developed rapidly in recent years to solve complex real-world problems using external tools. However, the scarcity of high-quality trajectories still hinders the development of stronger LLM agents. Most existing works on multi-turn dialogue synthesis validate correctness only at the trajectory level, which may overlook turn-level errors that can propagate during training and degrade model performance. To address these limitations, we introduce ToolMind, a large-scale, high-quality tool-agentic dataset with 160k synthetic data instances generated using over 20k tools and 200k augmented open-source data instances. Our data synthesis pipeline first constructs a function graph based on parameter correlations and then uses a multi-agent framework to simulate realistic user-assistant-tool interactions. Beyond trajectory-level validation, we employ fine-grained turn-level filtering to remove erroneous or suboptimal steps, ensuring that only high-quality reasoning traces are retained. This approach mitigates error amplification during training while preserving self-corrective reasoning signals essential for robust tool-use learning. Models fine-tuned on ToolMind show significant improvements over baselines on several benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Scarcity of high-quality trajectories hinders development of stronger LLM agents

Existing works overlook turn-level errors that propagate during training

Need for large-scale tool-use dataset with fine-grained quality validation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale tool-agentic dataset with synthetic instances

Multi-agent framework simulating user-assistant-tool interactions

Fine-grained turn-level filtering for high-quality reasoning traces

🔎 Similar Papers

Can Tool-augmented Large Language Models be Aware of Incomplete Conditions?

2024-06-18arXiv.orgCitations: 2

💼 Related Jobs

Machine Learning Engineer - Agentic AI

Apple

Sunnyvale, United States of America

Authors to Follow