🤖 AI Summary
This work addresses the limitation of current large language models in autonomous tool use, which stems from a scarcity of diverse and realistic multi-turn tool interaction data. The authors propose a novel paradigm that automatically synthesizes multi-turn tool-use trajectories from general-purpose text corpora, treating natural text as a scalable source of behavioral traces for the first time. Their approach employs a four-stage pipeline—comprising relevance filtering, workflow and tool extraction, trajectory embodiment, and complexity optimization—alongside a dedicated trajectory synthesis model fine-tuned with supervised learning to enable efficient and generalizable data generation. Evaluated on the BFCL V3 multi-turn benchmark, the resulting GEM-32B model achieves a 16.5% performance gain, surpassing certain models trained on domain-specific τ-bench data while significantly reducing inference latency and computational cost.
📝 Abstract
Enabling Large Language Models (LLMs) to effectively utilize tools in multi-turn interactions is essential for building capable autonomous agents. However, acquiring diverse and realistic multi-turn tool-use data remains a significant challenge. In this work, we propose a novel text-based paradigm. We observe that textual corpora naturally contain rich, multi-step problem-solving experiences, which can serve as an untapped, scalable, and authentic data source for multi-turn tool-use tasks. Based on this insight, we introduce GEM, a data synthesis pipeline that enables the generation and extraction of multi-turn tool-use trajectories from text corpora through a four-stage process: relevance filtering, workflow&tool extraction, trajectory grounding, and complexity refinement. To reduce the computational cost, we further train a specialized Trajectory Synthesizer via supervised fine-tuning. This model distills the complex generation pipeline into an efficient, end-to-end trajectory generator. Experiments demonstrate that our GEM-32B achieve a 16.5% improvement on the BFCL V3 Multi-turn benchmark. Our models partially surpass the performance of models trained on {\tau} - bench (Airline and Retail) in-domain data, highlighting the superior generalization capability derived from our text-based synthesis paradigm. Notably, our Trajectory Synthesizer matches the quality of the full pipeline while significantly reducing inference latency and costs.