🤖 AI Summary
This work addresses the scarcity of large-scale, verifiable tool-use datasets grounded in real API behaviors. To this end, we propose a reverse synthesis framework that first guides a large language model to explore APIs along graph-structured directed acyclic graphs (DAGs) on real MCP servers and then retroactively generates tasks from observed execution traces, ensuring label authenticity. We introduce a tool-relation graph and a sub-DAG sampling mechanism to scale exploration across extensive tool spaces, alongside a retrieval-augmented simulator with cache replay to mitigate environment drift. Using this approach, we generate 5,144 verified tasks spanning 240 servers and 993 tools. A 4B-parameter model trained on this data matches the performance of Claude Sonnet 4.6 on a held-out test set and achieves substantial gains on established benchmarks including Tau2-Bench, MCPMark, and MCP-Atlas.
📝 Abstract
Training tool-calling agents requires large-scale trajectory data with verifiable labels, yet existing approaches either synthesize environments that diverge from real API behavior or generate tasks without ground-truth outcomes for verification. We present FireFly, a pipeline for generating verified tool-call data from real-world MCP servers. Our key insight is to invert the standard synthesis pipeline: rather than generating tasks and hoping they are solvable, we first let a strong LLM explore real APIs along graph-guided DAG structures, then synthesize tasks backward from observed outcomes, guaranteeing label correctness by construction. To handle the scale of real-world tool spaces (${\sim}$1,000 tools), we build a pairwise tool graph and sample sub-DAGs to focus exploration on semantically coherent workflows. To address environment drift in live APIs, we construct a retrieval-augmented simulator that caches all exploration results and replays them during training and evaluation, enabling fully offline and reproducible RL. Applying this pipeline yields 5,144 verified tasks spanning 240 servers and 993 tools. A 4B-parameter model trained with GRPO on FireFly matches Claude Sonnet 4.6 on our held-out test set and shows improvements on multiple tool-calling benchmarks including Tau2-Bench, MCPMark, and MCP-Atlas.