RubricRefine: Improving Tool-Use Agent Reliability with Training-Free Pre-Execution Refinement

📅 2026-05-10

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This work addresses the frequent failures of tool-calling agents in code mode caused by inter-tool contract violations—such as incorrect output formats, routing errors, or broken parameter provenance—which often lack runtime exceptions and thus hinder execution-based self-correction. The authors propose a training-free, pre-execution reliability enhancement method that introduces, for the first time, a scoring-rule-based contract validation mechanism. By adaptively generating scoring rules from task and tool registries, the approach performs static contract checks on candidate code and iteratively repairs violations—all without any actual execution. Evaluated on the M3ToolEval benchmark, this method achieves an average accuracy of 0.86, surpassing all existing reasoning-time baselines, while incurring only 38% of the latency of the strongest non-iterative alternative and demonstrating consistent effectiveness across seven diverse models.

📝 Abstract

Iterative self-refinement is a popular inference-time reliability technique, but its effectiveness in code-mode tool use depends heavily on the structure of the feedback signal: unstructured critique helps inconsistently across models, and even revision with real execution feedback improves only modestly ($0.75$ vs. $0.65$ baseline). The dominant failures are inter-tool contract violations - wrong output shape, incorrect tool routing, broken argument provenance - that run to completion without raising errors, making runtime feedback insufficient. We introduce RubricRefine, a training-free pre-execution reliability layer that generates task- and registry-specific rubrics, scores candidate code against explicit contract checks, and iteratively repairs failures before any execution occurs. With zero execution attempts, RubricRefine reaches $0.86$ on M3ToolEval averaged across seven models-improving over prior inference-time baselines on every model tested on this benchmark, at $2.6X$ lower latency than the strongest non-iterative alternative - and remains flat on the predominantly single-step API-Bank, consistent with the method's reliance on inter-tool contract structure. A rubric-category ablation and calibration analysis further characterize when and why the method works.

Problem

Research questions and friction points this paper is trying to address.

tool-use reliability

contract violations

execution feedback

code-mode tool use

inference-time refinement

Innovation

Methods, ideas, or system contributions that make the work stand out.

RubricRefine

tool-use reliability

pre-execution refinement