🤖 AI Summary
This work addresses key challenges in tool-augmented large language models, including inconsistent interaction representations, neglect of structural distributions in usage trajectories, and fragmented evaluation benchmarks. To overcome these limitations, the authors propose a unified framework featuring a standardized Query–Action–Observation–Answer representation and introduce an Anchor Linkage mechanism to model cross-turn dependencies. This enables fine-grained evaluation across function calls, turns, and dialogue levels. Leveraging a diverse dataset of over 390k synthetic and real trajectories—spanning single/multi-hop, single/multi-turn, and serial/parallel structures—drawn from a pool of more than 22k tools, the framework employs structure-controlled trajectory augmentation during training. When fine-tuned on Qwen3-8B, the approach achieves a strict exact-match accuracy of 93.0% on single-turn tasks under the high-interference Hybrid-20 setting, substantially outperforming commercial models such as GPT, Gemini, and Claude.
📝 Abstract
Tool-use capability is a fundamental component of LLM agents, enabling them to interact with external systems through structured function calls. However, existing research exhibits inconsistent interaction representations, largely overlooks the structural distribution of tool-use trajectories, and relies on incompatible evaluation benchmarks. We present UniToolCall, a unified framework for tool learning that standardizes the entire pipeline from toolset construction and dataset generation to evaluation. The framework curates a large tool pool of 22k+ tools and constructs a hybrid training corpus of 390k+ instances by combining 10 standardized public datasets with structurally controlled synthetic trajectories. It explicitly models diverse interaction patterns, including single-hop vs. multi-hop and single-turn vs. multi-turn, while capturing both serial and parallel execution structures. To support coherent multi-turn reasoning, we further introduce an Anchor Linkage mechanism that enforces cross-turn dependencies. Furthermore, we convert 7 public benchmarks into a unified Query--Action--Observation--Answer (QAOA) representation with fine-grained evaluation at the function-call, turn, and conversation levels. Experiments show that fine-tuning Qwen3-8B on our dataset substantially improves tool-use performance. Under the distractor-heavy Hybrid-20 setting, achieves 93.0% single-turn Strict Precision, outperforming commercial models including GPT, Gemini, and Claude.