OpaqueToolsBench: Learning Nuances of Tool Behavior Through Interaction

📅 2026-02-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges large language models face when using poorly documented and behaviorally opaque tools, where existing automated documentation methods are either costly or ineffective. The authors propose ToolObserver, a novel framework that dynamically refines tool documentation by iteratively analyzing execution feedback from tool invocation traces, combined with a lightweight exploration strategy to enable efficient and low-cost tool behavior learning. To evaluate this approach, they introduce OpaqueToolsBench, a benchmark encompassing three task categories: function calling, interactive chess playing, and long-horizon intelligent search. Experimental results demonstrate that ToolObserver significantly outperforms baseline methods in complex scenarios, reducing total token consumption during test-time tool exploration by 3.5–7.5× while maintaining high performance.

Technology Category

Application Category

📝 Abstract
Tool-calling is essential for Large Language Model (LLM) agents to complete real-world tasks. While most existing benchmarks assume simple, perfectly documented tools, real-world tools (e.g., general "search" APIs) are often opaque, lacking clear best practices or failure modes. Can LLM agents improve their performance in environments with opaque tools by interacting and subsequently improving documentation? To study this, we create OpaqueToolsBench, a benchmark consisting of three distinct task-oriented environments: general function calling, interactive chess playing, and long-trajectory agentic search. Each environment provides underspecified tools that models must learn to use effectively to complete the task. Results on OpaqueToolsBench suggest existing methods for automatically documenting tools are expensive and unreliable when tools are opaque. To address this, we propose a simple framework, ToolObserver, that iteratively refines tool documentation by observing execution feedback from tool-calling trajectories. Our approach outperforms existing methods on OpaqueToolsBench across datasets, even in relatively hard settings. Furthermore, for test-time tool exploration settings, our method is also efficient, consuming 3.5-7.5x fewer total tokens than the best baseline.
Problem

Research questions and friction points this paper is trying to address.

opaque tools
tool-calling
LLM agents
tool documentation
real-world tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

opaque tools
tool documentation
iterative refinement
LLM agents
execution feedback
🔎 Similar Papers
No similar papers found.