🤖 AI Summary
Current large language models struggle in open-world settings when confronted with extensive tool libraries, long-horizon tasks, complex constraints, and unreliable tool states, and there is a lack of scalable, realistic evaluation environments. This work proposes the first scalable open-world tool-use benchmark, integrating 5,571 tools—standardized in a unified format and spanning 204 commonly used applications. A task-generation engine coupled with a state controller synthesizes multi-tool, long-horizon tasks and injects perturbations to evaluate agent robustness. Leveraging a planning-execution decoupling framework, the study reveals a mismatch between tool planning and execution in existing models. Experiments demonstrate significant deficiencies in constraint adherence and robustness among current models, with DeepSeek-v3.2 emerging as the strongest performer. Fine-tuning on 1,170 trajectories collected from this environment outperforms baseline methods trained on 119k samples.
📝 Abstract
Tool-using LLM agents still struggle in open-world settings with large tool pools, long-horizon objectives, wild constraints, and unreliable tool states. For scalable and realistic training and testing, we introduce an open-world tool-using environment, built on 5,571 format unified tools across 204 commonly used apps. It includes a task creation engine that synthesizes long-horizon, multi-tool workflows with wild constraints, and a state controller that injects interruptions and failures to stress-test robustness. On top of this environment, we develop a tool select-then-execute agent framework with a planner-actor decomposition to separate deliberate reasoning and self-correction from step-wise execution. Comprehensive evaluation of state-of-the-art LLMs reveals the misalignment between tool planning and execution abilities, the constraint following weakness of existing LLMs, and DeepSeek-v3.2's strongest robustness. Finally, we collect 1,170 trajectories from our environment to fine-tune LLMs, achieving superior performance to baselines using 119k samples, indicating the environment's value as both a realistic benchmark and a data engine for tool-using agents. Our code and data will be publicly released.