ToolForge: A Data Synthesis Pipeline for Multi-Hop Search without Real-World APIs

📅 2025-12-17
📈 Citations: 0
Influential: 0
📄 PDF

career value

200K/year
🤖 AI Summary
Existing tool learning approaches rely on real API calls, incurring high computational costs, exhibiting poor generalization, and lacking multi-hop reasoning and self-reflection capabilities. Method: We propose the first real-API-call-free framework for synthesizing multi-hop search tool learning data. Given (question, gold context, answer) triplets, it automatically generates high-quality, diverse training data via lightweight virtual tool modeling. Our method innovatively integrates multi-hop reasoning chain generation with self-reflection enhancement, and establishes a multi-layer verification system—combining rule-based and model-based checks—to ensure data fidelity. Contribution/Results: Experiments demonstrate that an 8B-parameter model trained on our synthetic data surpasses GPT-4o across multiple benchmarks. To foster reproducibility and community advancement, we publicly release both the code and the dataset.

Technology Category

Application Category

📝 Abstract
Training LLMs to invoke tools and leverage retrieved information necessitates high-quality, diverse data. However, existing pipelines for synthetic data generation often rely on tens of thousands of real API calls to enhance generalization, incurring prohibitive costs while lacking multi-hop reasoning and self-reflection. To address these limitations, we introduce ToolForge, an automated synthesis framework that achieves strong real-world tool-calling performance by constructing only a small number of virtual tools, eliminating the need for real API calls. ToolForge leverages a (question, golden context, answer) triple to synthesize large-scale tool-learning data specifically designed for multi-hop search scenarios, further enriching the generated data through multi-hop reasoning and self-reflection mechanisms. To ensure data fidelity, we employ a Multi-Layer Validation Framework that integrates both rule-based and model-based assessments. Empirical results show that a model with only 8B parameters, when trained on our synthesized data, outperforms GPT-4o on multiple benchmarks. Our code and dataset are publicly available at https://github.com/Buycar-arb/ToolForge .
Problem

Research questions and friction points this paper is trying to address.

Synthesizes tool-learning data without real API calls
Enables multi-hop search with self-reflection mechanisms
Trains smaller models to outperform larger ones on benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated synthesis framework eliminates real API calls
Multi-hop reasoning and self-reflection enrich generated data
Multi-Layer Validation ensures data fidelity with rule and model assessments
🔎 Similar Papers
2024-09-02International Conference on Learning RepresentationsCitations: 48