🤖 AI Summary
Existing LLM-agent reinforcement learning methods rely on static tool sets, limiting flexibility and generalization in long-horizon reasoning under dynamically evolving tool environments. To address this, we propose a dynamic multi-step tool selection mechanism that enables real-time identification, ranking, and integration of previously unseen tools during inference. We construct the first large-scale, explicitly reasoned tool-selection dataset—comprising 200K samples, 1,000+ tools, and 100+ tasks. Further, we design a supervised fine-tuning plus RL trajectory stabilization framework, incorporating KL-regularized Plackett–Luce ranking models, and achieve dual-modality co-adaptation for Qwen3-8B and Qwen2.5-VL-7B. Our approach yields average improvements of +6.4% on mathematical and scientific reasoning, +4.5% on search-based QA, +7.7% on code generation, and +6.9% on multimodal understanding across ten benchmarks—while using fewer parameters and demonstrating superior zero-shot tool generalization.
📝 Abstract
Agentic reinforcement learning has advanced large language models (LLMs) to reason through long chain-of-thought trajectories while interleaving external tool use. Existing approaches assume a fixed inventory of tools, limiting LLM agents' adaptability to new or evolving toolsets. We present AutoTool, a framework that equips LLM agents with dynamic tool-selection capabilities throughout their reasoning trajectories. We first construct a 200k dataset with explicit tool-selection rationales across 1,000+ tools and 100+ tasks spanning mathematics, science, code generation, and multimodal reasoning. Building on this data foundation, AutoTool employs a dual-phase optimization pipeline: (i) supervised and RL-based trajectory stabilization for coherent reasoning, and (ii) KL-regularized Plackett-Luce ranking to refine consistent multi-step tool selection. Across ten diverse benchmarks, we train two base models, Qwen3-8B and Qwen2.5-VL-7B, with AutoTool. With fewer parameters, AutoTool consistently outperforms advanced LLM agents and tool-integration methods, yielding average gains of 6.4% in math & science reasoning, 4.5% in search-based QA, 7.7% in code generation, and 6.9% in multimodal understanding. In addition, AutoTool exhibits stronger generalization by dynamically leveraging unseen tools from evolving toolsets during inference.