🤖 AI Summary
High-quality tool-use trajectory data is scarce, hindering large language models’ reliable invocation of external tools in complex tasks. Existing approaches predominantly rely on multi-turn dialogue-level correctness verification, which fails to effectively suppress error propagation at the turn level. This paper introduces ToolMind, a novel framework that models tool semantics via a function graph and simulates realistic interactions through a tri-agent collaboration—comprising user, assistant, and tool agents—to generate high-fidelity, multi-turn trajectories. It further proposes a turn-level fine-grained filtering mechanism that actively identifies and removes erroneous steps during training while preserving self-correcting reasoning signals. Combined with parameter correlation analysis and synthetic data augmentation, ToolMind significantly enhances trajectory quality. Empirical evaluation across multiple benchmarks demonstrates that ToolMind-finetuned models substantially outperform state-of-the-art baselines in both tool-call accuracy and multi-step reasoning capability.
📝 Abstract
Large Language Model (LLM) agents have developed rapidly in recent years to solve complex real-world problems using external tools. However, the scarcity of high-quality trajectories still hinders the development of stronger LLM agents. Most existing works on multi-turn dialogue synthesis validate correctness only at the trajectory level, which may overlook turn-level errors that can propagate during training and degrade model performance. To address these limitations, we introduce ToolMind, a large-scale, high-quality tool-agentic dataset with 160k synthetic data instances generated using over 20k tools and 200k augmented open-source data instances. Our data synthesis pipeline first constructs a function graph based on parameter correlations and then uses a multi-agent framework to simulate realistic user-assistant-tool interactions. Beyond trajectory-level validation, we employ fine-grained turn-level filtering to remove erroneous or suboptimal steps, ensuring that only high-quality reasoning traces are retained. This approach mitigates error amplification during training while preserving self-corrective reasoning signals essential for robust tool-use learning. Models fine-tuned on ToolMind show significant improvements over baselines on several benchmarks.