🤖 AI Summary
Large language models (LLMs) exhibit inconsistent reasoning, strategy drift, and inaccurate long-range information extraction in multi-turn dynamic tool-use scenarios (e.g., τ-bench). To address these challenges, we propose the Input-Reformulation Multi-Agent (IRMA) framework. IRMA automatically reformulates user queries to explicitly integrate domain-specific rules and tool-call suggestions, while leveraging multi-agent coordination to enhance decision consistency. Methodologically, it unifies ReAct prompting, native function calling, and self-reflection capabilities, augmented by human-in-the-loop error analysis and iterative policy refinement. Experimental results demonstrate that IRMA achieves absolute improvements of 16.1%, 12.7%, and 19.1% over ReAct, Function Calling, and Self-Reflection baselines, respectively, on the pass⁵ metric. This yields substantial gains in reliability, focus, and behavioral stability during complex interactive tool use—establishing a novel paradigm for controllable, adaptive tool orchestration in dynamic environments.
📝 Abstract
Recent advances in reasoning and planning capabilities of large language models (LLMs) have enabled their potential as autonomous agents capable of tool use in dynamic environments. However, in multi-turn conversational environments like $τ$-bench, these agents often struggle with consistent reasoning, adherence to domain-specific policies, and extracting correct information over a long horizon of tool-calls and conversation. To capture and mitigate these failures, we conduct a comprehensive manual analysis of the common errors occurring in the conversation trajectories. We then experiment with reformulations of inputs to the tool-calling agent for improvement in agent decision making. Finally, we propose the Input-Reformulation Multi-Agent (IRMA) framework, which automatically reformulates user queries augmented with relevant domain rules and tool suggestions for the tool-calling agent to focus on. The results show that IRMA significantly outperforms ReAct, Function Calling, and Self-Reflection by 16.1%, 12.7%, and 19.1%, respectively, in overall pass^5 scores. These findings highlight the superior reliability and consistency of IRMA compared to other methods in dynamic environments.