🤖 AI Summary
This work addresses the limitations of large language model (LLM) agents in tool invocation, which are highly dependent on the quality of human-written tool interface descriptions and suffer significant performance degradation in cold-start scenarios with numerous candidate tools or absent execution traces. To overcome this, the authors propose Trace-Free+, a framework that leverages curriculum learning to transfer supervised knowledge from trace-rich environments to trace-free deployment settings. This approach guides the model to learn reusable tool-use patterns and automatically refine tool descriptions without relying on execution trajectories. Trace-Free+ supports cross-tool generalization and scales effectively to tool sets comprising hundreds of functions. Experiments on StableToolBench and RestBench demonstrate that Trace-Free+ substantially improves invocation accuracy on unseen tools, exhibiting strong cross-domain generalization and robustness at scale.
📝 Abstract
The performance of LLM-based agents depends not only on the agent itself but also on the quality of the tool interfaces it consumes. While prior work has focused heavily on agent fine-tuning, tool interfaces-including natural language descriptions and parameter schemas-remain largely human-oriented and often become a bottleneck, especially when agents must select from large candidate tool sets. Existing approaches to improving tool interfaces rely on execution traces, which are frequently unavailable in cold-start or privacy-constrained settings, and typically optimize each tool independently, limiting scalability and generalization to unseen tools. We propose Trace-Free+, a curriculum learning framework that progressively transfers supervision from trace-rich settings to trace-free deployment, encouraging the model to abstract reusable interface-usage patterns and tool usage outcomes. To support this approach, we construct a large-scale dataset of high-quality tool interfaces using a structured workflow over a diverse collection of tools. Experiments on StableToolBench and RestBench show consistent gains on unseen tools, strong cross-domain generalization, and robustness as the number of candidate tools scales to over 100, demonstrating that tool interface optimization is a practical and deployable complement to agent fine-tuning.