Reinforced Agent: Inference-Time Feedback for Tool-Calling Agents

📅 2026-04-29

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This work addresses the limitation of existing tool-calling evaluation methods, which predominantly rely on post-hoc analysis and thus cannot correct errors in real time during reasoning. To overcome this, the authors propose a dual-agent architecture featuring an independent reviewer agent that evaluates tool calls before execution, shifting the paradigm from passive correction to proactive intervention. The reviewer’s decisions are guided by a Helpfulness-Harmlessness metric that quantifies the trade-off between potential benefits and risks. Coupled with an inference-time feedback mechanism and GEPA-based automatic prompt optimization, this approach enhances system performance without requiring model retraining. Empirical results demonstrate accuracy improvements of 5.5% and 7.1% on BFCL and Tau2-Bench, respectively, with the o3-mini model achieving a benefit-to-risk ratio of 3:1; further gains of 1.5–2.8% are attributable to GEPA optimization.

📝 Abstract

Tool-calling agents are evaluated on tool selection, parameter accuracy, and scope recognition, yet LLM trajectory assessments remain inherently post-hoc. Disconnected from the active execution loop, such assessments identify errors that are usually addressed through prompt-tuning or retraining, and fundamentally cannot course-correct the agent in real time. To close this gap, we move evaluation into the execution loop at inference time: a specialized reviewer agent evaluates provisional tool calls prior to execution, shifting the paradigm from post-hoc recovery to proactive evaluation and error mitigation. In practice, this architecture establishes a clear separation of concerns between the primary execution agent and a secondary review agent. As with any multi-agent system, the reviewer can introduce new errors while correcting others, yet no prior work to our knowledge has systematically measured this tradeoff. To quantify this tradeoff, we introduce Helpfulness-Harmfulness metrics: helpfulness measures the percentage of base agent errors that feedback corrects; harmfulness measures the percentage of correct responses that feedback degrades. These metrics directly inform reviewer design by revealing whether a given model or prompt provides net positive value. We evaluate our approach on BFCL (single-turn) and Tau2-Bench (multi-turn stateful scenarios), achieving +5.5% on irrelevance detection and +7.1% on multi-turn tasks. Our metrics reveal that reviewer model choice is critical: the reasoning model o3-mini achieves a 3:1 benefit-to-risk ratio versus 2.1:1 for GPT-4o. Automated prompt optimization via GEPA provides an additional +1.5-2.8%. Together, these results demonstrate a core advantage of separating execution and review: the reviewer can be systematically improved through model selection and prompt optimization, without retraining the base agent.

Problem

Research questions and friction points this paper is trying to address.

tool-calling agents

inference-time feedback

real-time error correction

post-hoc evaluation

execution loop

Innovation

Methods, ideas, or system contributions that make the work stand out.

inference-time feedback

tool-calling agents

reviewer agent