🤖 AI Summary
This paper systematically identifies four fundamental bottlenecks hindering large language models (LLMs) in tool learning: weak autonomous learning capability, poor cross-task generalization, difficulty in solving long-horizon tasks, and critical capability blind spots in existing benchmarks. To address these challenges, we introduce ToLeaP—an open-source tool learning platform—enabling reproducible, multi-dimensional evaluation of 41 LLMs across 33 diverse benchmarks. Methodologically, we propose compatibility-aware autonomous learning, cue identification and backtracking, and develop a one-click evaluation framework, bad-case-driven root-cause analysis, and lightweight inference tracing. ToLeaP also releases 21 high-quality, human-curated training datasets. Empirical validation demonstrates that our approaches significantly improve tool invocation success rates and robustness. Collectively, ToLeaP establishes a new paradigm for benchmark construction and capability advancement in tool-augmented LLM research.
📝 Abstract
Tool learning, which enables large language models (LLMs) to utilize external tools effectively, has garnered increasing attention for its potential to revolutionize productivity across industries. Despite rapid development in tool learning, key challenges and opportunities remain understudied, limiting deeper insights and future advancements. In this paper, we investigate the tool learning ability of 41 prevalent LLMs by reproducing 33 benchmarks and enabling one-click evaluation for seven of them, forming a Tool Learning Platform named ToLeaP. We also collect 21 out of 33 potential training datasets to facilitate future exploration. After analyzing over 3,000 bad cases of 41 LLMs based on ToLeaP, we identify four main critical challenges: (1) benchmark limitations induce both the neglect and lack of (2) autonomous learning, (3) generalization, and (4) long-horizon task-solving capabilities of LLMs. To aid future advancements, we take a step further toward exploring potential directions, namely (1) real-world benchmark construction, (2) compatibility-aware autonomous learning, (3) rationale learning by thinking, and (4) identifying and recalling key clues. The preliminary experiments demonstrate their effectiveness, highlighting the need for further research and exploration.