ToolForge: A Data Synthesis Pipeline for Multi-Hop Search without Real-World APIs

📅 2025-12-17

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Existing tool learning approaches rely on real API calls, incurring high computational costs, exhibiting poor generalization, and lacking multi-hop reasoning and self-reflection capabilities. Method: We propose the first real-API-call-free framework for synthesizing multi-hop search tool learning data. Given (question, gold context, answer) triplets, it automatically generates high-quality, diverse training data via lightweight virtual tool modeling. Our method innovatively integrates multi-hop reasoning chain generation with self-reflection enhancement, and establishes a multi-layer verification system—combining rule-based and model-based checks—to ensure data fidelity. Contribution/Results: Experiments demonstrate that an 8B-parameter model trained on our synthetic data surpasses GPT-4o across multiple benchmarks. To foster reproducibility and community advancement, we publicly release both the code and the dataset.

Technology Category

Application Category

📝 Abstract

Training LLMs to invoke tools and leverage retrieved information necessitates high-quality, diverse data. However, existing pipelines for synthetic data generation often rely on tens of thousands of real API calls to enhance generalization, incurring prohibitive costs while lacking multi-hop reasoning and self-reflection. To address these limitations, we introduce ToolForge, an automated synthesis framework that achieves strong real-world tool-calling performance by constructing only a small number of virtual tools, eliminating the need for real API calls. ToolForge leverages a (question, golden context, answer) triple to synthesize large-scale tool-learning data specifically designed for multi-hop search scenarios, further enriching the generated data through multi-hop reasoning and self-reflection mechanisms. To ensure data fidelity, we employ a Multi-Layer Validation Framework that integrates both rule-based and model-based assessments. Empirical results show that a model with only 8B parameters, when trained on our synthesized data, outperforms GPT-4o on multiple benchmarks. Our code and dataset are publicly available at https://github.com/Buycar-arb/ToolForge .

Problem

Research questions and friction points this paper is trying to address.

Synthesizes tool-learning data without real API calls

Enables multi-hop search with self-reflection mechanisms

Trains smaller models to outperform larger ones on benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated synthesis framework eliminates real API calls

Multi-hop reasoning and self-reflection enrich generated data

Multi-Layer Validation ensures data fidelity with rule and model assessments

🔎 Similar Papers

ToolACE: Winning the Points of LLM Function Calling

2024-09-02International Conference on Learning RepresentationsCitations: 48

Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources

2024-09-12arXiv.orgCitations: 13

ToolGen: Unified Tool Retrieval and Calling via Generation

2024-10-04arXiv.orgCitations: 5