ToolForge: A Data Synthesis Pipeline for Multi-Hop Search without Real-World APIs

📅 2025-12-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing tool learning approaches rely on real API calls, incurring high computational costs, exhibiting poor generalization, and lacking multi-hop reasoning and self-reflection capabilities. Method: We propose the first real-API-call-free framework for synthesizing multi-hop search tool learning data. Given (question, gold context, answer) triplets, it automatically generates high-quality, diverse training data via lightweight virtual tool modeling. Our method innovatively integrates multi-hop reasoning chain generation with self-reflection enhancement, and establishes a multi-layer verification system—combining rule-based and model-based checks—to ensure data fidelity. Contribution/Results: Experiments demonstrate that an 8B-parameter model trained on our synthetic data surpasses GPT-4o across multiple benchmarks. To foster reproducibility and community advancement, we publicly release both the code and the dataset.

Technology Category

Application Category

📝 Abstract
Training LLMs to invoke tools and leverage retrieved information necessitates high-quality, diverse data. However, existing pipelines for synthetic data generation often rely on tens of thousands of real API calls to enhance generalization, incurring prohibitive costs while lacking multi-hop reasoning and self-reflection. To address these limitations, we introduce ToolForge, an automated synthesis framework that achieves strong real-world tool-calling performance by constructing only a small number of virtual tools, eliminating the need for real API calls. ToolForge leverages a (question, golden context, answer) triple to synthesize large-scale tool-learning data specifically designed for multi-hop search scenarios, further enriching the generated data through multi-hop reasoning and self-reflection mechanisms. To ensure data fidelity, we employ a Multi-Layer Validation Framework that integrates both rule-based and model-based assessments. Empirical results show that a model with only 8B parameters, when trained on our synthesized data, outperforms GPT-4o on multiple benchmarks. Our code and dataset are publicly available at https://github.com/Buycar-arb/ToolForge .
Problem

Research questions and friction points this paper is trying to address.

Synthesizes tool-learning data without real API calls
Enables multi-hop search with self-reflection mechanisms
Trains smaller models to outperform larger ones on benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated synthesis framework eliminates real API calls
Multi-hop reasoning and self-reflection enrich generated data
Multi-Layer Validation ensures data fidelity with rule and model assessments
🔎 Similar Papers
2024-09-02International Conference on Learning RepresentationsCitations: 48
H
Hao Chen
North China University of Technology, Meituan
Z
Zhexin Hu
Meituan, Institute of Software, Chinese Academy of Sciences
Jiajun Chai
Jiajun Chai
Meituan Inc.
Reinforcement LearningLLMsAgentic Learning
H
Haocheng Yang
Meituan, National University of Singapore
Hang He
Hang He
East China Normal University
AI AgentReinforcement LearningVLMIRLLM4SE
X
Xiaohan Wang
Meituan
W
Wei Lin
Meituan
L
Luhang Wang
North China University of Technology
Guojun Yin
Guojun Yin
Meituan, University of Science and Technology of China
MultimodalityComputer VisionFoundation ModelsDeep LearningImage/Video Processing
Z
Zhuofeng Zhao
North China University of Technology