🤖 AI Summary
Existing synthetic datasets are primarily designed for offline supervised fine-tuning and struggle to support executable, reward-verifiable online reinforcement learning. This work proposes COVERT, a two-stage framework that first generates high-quality base trajectories through self-evolving synthesis and multi-level validation. It then innovatively constructs a reinforcement learning environment with automatically computable rewards and verifiable agent behaviors by introducing complex perturbations—such as distracting tools, ambiguous queries, and noisy outputs—while preserving the oracle tool calls and final answers. Experiments on Qwen2.5-Instruct-14B demonstrate that COVERT improves BFCL v3 and ACEBench accuracy from 56.5/53.0 to 59.9/59.3, respectively; further gains to 62.1/61.8 are achieved when combined with supervised fine-tuning, without significant degradation in general capabilities.
📝 Abstract
Existing synthetic tool-use corpora are primarily designed for offline supervised fine-tuning, yet reinforcement learning (RL) requires executable environments that support reward-checkable online rollouts. We propose COVERT, a two-stage pipeline that first generates reliable base tool-use trajectories through self-evolving synthesis with multi-level validation, and then applies oracle-preserving augmentations that systematically increase environmental complexity. These augmentations introduce distractor tools, indirect or ambiguous user queries, and noisy, multi-format, or erroneous tool outputs, while strictly preserving oracle tool calls and final answers as ground truth. This design enables automatic reward computation via reference matching for standard cases and lightweight judge-assisted verification for special behaviors such as error detection, supporting RL optimization of tool-calling policies. On Qwen2.5-Instruct-14B, COVERT-RL improves overall accuracy on BFCL v3 from 56.5 to 59.9 and on ACEBench from 53.0 to 59.3, with minimal regressions on general-ability benchmarks; when stacked on SFT, it further reaches 62.1 and 61.8, confirming additive gains. These results suggest that oracle-preserving synthetic environments offer a practical RL refinement stage, complementary to SFT, for improving tool-use robustness under ambiguity and unreliable tool feedback.