Firefly: Illuminating Large-Scale Verified Tool-Call Data Generation from Real APIs

📅 2026-05-17
📈 Citations: 0
Influential: 0
📄 PDF

career value

189K/year
🤖 AI Summary
This work addresses the scarcity of large-scale, verifiable tool-use datasets grounded in real API behaviors. To this end, we propose a reverse synthesis framework that first guides a large language model to explore APIs along graph-structured directed acyclic graphs (DAGs) on real MCP servers and then retroactively generates tasks from observed execution traces, ensuring label authenticity. We introduce a tool-relation graph and a sub-DAG sampling mechanism to scale exploration across extensive tool spaces, alongside a retrieval-augmented simulator with cache replay to mitigate environment drift. Using this approach, we generate 5,144 verified tasks spanning 240 servers and 993 tools. A 4B-parameter model trained on this data matches the performance of Claude Sonnet 4.6 on a held-out test set and achieves substantial gains on established benchmarks including Tau2-Bench, MCPMark, and MCP-Atlas.
📝 Abstract
Training tool-calling agents requires large-scale trajectory data with verifiable labels, yet existing approaches either synthesize environments that diverge from real API behavior or generate tasks without ground-truth outcomes for verification. We present FireFly, a pipeline for generating verified tool-call data from real-world MCP servers. Our key insight is to invert the standard synthesis pipeline: rather than generating tasks and hoping they are solvable, we first let a strong LLM explore real APIs along graph-guided DAG structures, then synthesize tasks backward from observed outcomes, guaranteeing label correctness by construction. To handle the scale of real-world tool spaces (${\sim}$1,000 tools), we build a pairwise tool graph and sample sub-DAGs to focus exploration on semantically coherent workflows. To address environment drift in live APIs, we construct a retrieval-augmented simulator that caches all exploration results and replays them during training and evaluation, enabling fully offline and reproducible RL. Applying this pipeline yields 5,144 verified tasks spanning 240 servers and 993 tools. A 4B-parameter model trained with GRPO on FireFly matches Claude Sonnet 4.6 on our held-out test set and shows improvements on multiple tool-calling benchmarks including Tau2-Bench, MCPMark, and MCP-Atlas.
Problem

Research questions and friction points this paper is trying to address.

tool-call data
verified labels
real APIs
trajectory data
ground-truth outcomes
Innovation

Methods, ideas, or system contributions that make the work stand out.

verified tool-call data
graph-guided DAG exploration
retrieval-augmented simulator
backward task synthesis
real-world API
🔎 Similar Papers
2024-09-02International Conference on Learning RepresentationsCitations: 48