When Simulation Lies: A Sim-to-Real Benchmark and Domain-Randomized RL Recipe for Tool-Use Agents

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

This work addresses the significant performance gap between simulated and real-world tool-augmented agents, which arises from input noise, unreliable APIs, and ambiguous tool registrations. The authors formalize tool usage as a partially observable Markov decision process (POMDP) and systematically analyze the impact of perturbations in observations, actions, rewards, and state transitions on agent robustness. They propose ToolRL-DR, a domain randomization–based reinforcement learning training framework, and introduce RobustBench-TC—the first benchmark encompassing 22 realistic failure modes. Experimental results demonstrate that ToolRL-DR-Full, built upon a 3B-parameter model, maintains 75% accuracy on clean inputs while achieving perturbation-robust performance comparable to a 14B open-source baseline and reducing the performance gap with o4-mini under transition perturbations by 27%.

📝 Abstract

Tool-use language agents are evaluated on benchmarks that assume clean inputs, unambiguous tool registries, and reliable APIs. Real deployments violate all these assumptions: user typos propagate into hallucinated tool names, a misconfigured request timeout can stall an agent indefinitely, and duplicate tool names across servers can freeze an SDK. We study these failures as a sim-to-real gap in the tool-use partially observable Markov decision process (POMDP), where deployment noise enters through the observation, action space, reward-relevant metadata, or transition dynamics. We introduce RobustBench-TC, a benchmark with 22 perturbation types organized by these four POMDP components, each grounded in a verified GitHub issue or documented tool-calling failure. Across 21 models from 1.5B to 32B parameters (including the closed-source o4-mini), the robustness profile is sharply uneven: observation perturbations reduce accuracy by less than 5%, while reward-relevant and transition perturbations reduce accuracy by roughly 40% and 30%, respectively; scale alone does not close these gaps. We then propose ToolRL-DR, a domain-randomization reinforcement learning (RL) recipe that trains a tool-use agent on perturbation-augmented trajectories spanning the three statically encodable POMDP components. On a 3B backbone, ToolRL-DR-Full retains roughly three-quarters of clean accuracy and reaches an aggregate perturbed accuracy comparable to open-source 14B function-calling baselines while substantially narrowing the gap to o4-mini. It closes approximately 27% of the Transition gap despite never seeing transition perturbations in training, suggesting that RL on adversarial static tool-use inputs induces a more persistent retry policy that transfers to unseen runtime failures. The dataset, code and benchmark leaderboard are publicly available.

Problem

Research questions and friction points this paper is trying to address.

sim-to-real gap

tool-use agents

POMDP

deployment noise

robustness

Innovation

Methods, ideas, or system contributions that make the work stand out.

sim-to-real gap

domain randomization

tool-use agents