TOUCAN: Synthesizing 1.5M Tool-Agentic Data from Real-World MCP Environments

๐Ÿ“… 2025-10-01
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

227K/year
๐Ÿค– AI Summary
Open-source communities lack high-quality, permissively licensed training data for tool-augmented agents; existing datasets severely underrepresent realistic, diverse multi-tool orchestration and multi-turn interactions. Method: We construct the largest open-source tool-agent dataset to date (1.5 million trajectories), synthesized for the first time from nearly 500 real-world Model Context Protocol (MCP) environments. We introduce three task expansion mechanisms and integrate five query-generation models, three teacher models, and two agent frameworks, with dual-model-and-rule quality verification and multi-turn dialogue simulation for filtering. Contribution/Results: Our dataset enables agents that significantly outperform larger closed-source models on BFCL V3 and MCP-Universe Benchโ€”advancing the Pareto frontier of tool-calling capability while promoting open, reproducible research in agentic AI.

Technology Category

Application Category

๐Ÿ“ Abstract
Large Language Model (LLM) agents are rapidly emerging as powerful systems for automating tasks across domains. Yet progress in the open-source community is constrained by the lack of high quality permissively licensed tool-agentic training data. Existing datasets are often limited in diversity, realism, and complexity, particularly regarding multi-tool and multi-turn interactions. To address this gap, we introduce Toucan, the largest publicly available tool-agentic dataset to date, containing 1.5 million trajectories synthesized from nearly 500 real-world Model Context Protocols (MCPs). Unlike prior work, Toucan leverages authentic MCP environments to generate diverse, realistic, and challenging tasks with trajectories involving real tool execution. Our pipeline first produces a broad spectrum of tool-use queries using five distinct models, applies model-based quality filtering, and then generates agentic trajectories with three teacher models using two agentic frameworks. Rigorous rule-based and model-based validation ensures high-quality outputs. We also introduce three extension mechanisms to further diversify tasks and simulate multi-turn conversations. Models fine-tuned on Toucan outperform larger closed-source counterparts on the BFCL V3 benchmark and push the Pareto frontier forward on MCP-Universe Bench.
Problem

Research questions and friction points this paper is trying to address.

Addressing the lack of high-quality tool-agentic training data
Generating diverse and realistic multi-tool interaction trajectories
Creating challenging tasks from real-world MCP environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthesized 1.5M tool-agentic trajectories from MCPs
Generated diverse queries using five distinct models
Applied rule-based and model-based validation for quality