ToLeaP: Rethinking Development of Tool Learning with Large Language Models

📅 2025-05-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper systematically identifies four fundamental bottlenecks hindering large language models (LLMs) in tool learning: weak autonomous learning capability, poor cross-task generalization, difficulty in solving long-horizon tasks, and critical capability blind spots in existing benchmarks. To address these challenges, we introduce ToLeaP—an open-source tool learning platform—enabling reproducible, multi-dimensional evaluation of 41 LLMs across 33 diverse benchmarks. Methodologically, we propose compatibility-aware autonomous learning, cue identification and backtracking, and develop a one-click evaluation framework, bad-case-driven root-cause analysis, and lightweight inference tracing. ToLeaP also releases 21 high-quality, human-curated training datasets. Empirical validation demonstrates that our approaches significantly improve tool invocation success rates and robustness. Collectively, ToLeaP establishes a new paradigm for benchmark construction and capability advancement in tool-augmented LLM research.

Technology Category

Application Category

📝 Abstract
Tool learning, which enables large language models (LLMs) to utilize external tools effectively, has garnered increasing attention for its potential to revolutionize productivity across industries. Despite rapid development in tool learning, key challenges and opportunities remain understudied, limiting deeper insights and future advancements. In this paper, we investigate the tool learning ability of 41 prevalent LLMs by reproducing 33 benchmarks and enabling one-click evaluation for seven of them, forming a Tool Learning Platform named ToLeaP. We also collect 21 out of 33 potential training datasets to facilitate future exploration. After analyzing over 3,000 bad cases of 41 LLMs based on ToLeaP, we identify four main critical challenges: (1) benchmark limitations induce both the neglect and lack of (2) autonomous learning, (3) generalization, and (4) long-horizon task-solving capabilities of LLMs. To aid future advancements, we take a step further toward exploring potential directions, namely (1) real-world benchmark construction, (2) compatibility-aware autonomous learning, (3) rationale learning by thinking, and (4) identifying and recalling key clues. The preliminary experiments demonstrate their effectiveness, highlighting the need for further research and exploration.
Problem

Research questions and friction points this paper is trying to address.

Investigates tool learning challenges in 41 LLMs using 33 benchmarks
Identifies four key limitations in LLMs' autonomous and generalization abilities
Proposes future directions for real-world benchmark and rationale learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed ToLeaP for one-click LLM evaluation
Collected 21 datasets for tool learning research
Identified four key challenges in LLM tool learning
🔎 Similar Papers
No similar papers found.