🤖 AI Summary
This work addresses the challenges of multi-step task execution in large-scale tool libraries, including planning complexity, the absence of effective evaluation frameworks, and high computational overhead. To this end, the authors introduce SLATE, the first benchmark platform for evaluating tool-augmented agents that supports diverse and valid execution trajectories. They further propose an Entropy-Guided Branching (EGB) algorithm that dynamically expands decision branches based on predictive uncertainty, adaptively balancing exploration and exploitation. Experimental results in a synthetic e-commerce API environment demonstrate that the proposed approach significantly improves both task success rates and computational efficiency, validating its scalability and robustness in tool-intensive scenarios.
📝 Abstract
Large Language Models (LLMs) have significantly advanced tool-augmented agents, enabling autonomous reasoning via API interactions. However, executing multi-step tasks within massive tool libraries remains challenging due to two critical bottlenecks: (1) the absence of rigorous, plan-level evaluation frameworks and (2) the computational demand of exploring vast decision spaces stemming from large toolsets and long-horizon planning. To bridge these gaps, we first introduce SLATE (Synthetic Large-scale API Toolkit for E-commerce), a large-scale context-aware benchmark designed for the automated assessment of tool-integrated agents. Unlike static metrics, SLATE accommodates diverse yet functionally valid execution trajectories, revealing that current agents struggle with self-correction and search efficiency. Motivated by these findings, we next propose Entropy-Guided Branching (EGB), an uncertainty-aware search algorithm that dynamically expands decision branches where predictive entropy is high. EGB optimizes the exploration-exploitation trade-off, significantly enhancing both task success rates and computational efficiency. Extensive experiments on SLATE demonstrate that our dual contribution provides a robust foundation for developing reliable and scalable LLM agents in tool-rich environments.