🤖 AI Summary
This work addresses the challenge faced by multi-turn interactive tool-using agents in generating correct and deterministic action sequences under complex and ambiguous user requests. To this end, the authors propose the CoVe framework, which uniquely integrates explicit task constraints to simultaneously guide trajectory generation and validate trajectory quality. This approach enables efficient synthesis of high-quality training data and provides precise reward signals for both supervised fine-tuning (SFT) and reinforcement learning (RL). The CoVe-4B model trained under this framework achieves success rates of 43.0% and 59.4% on the Airline and Retail domains of the τ²-bench, respectively—significantly outperforming strong baselines of comparable scale and matching the performance of models up to 17 times larger.
📝 Abstract
Developing multi-turn interactive tool-use agents is challenging because real-world user needs are often complex and ambiguous, yet agents must execute deterministic actions to satisfy them. To address this gap, we introduce \textbf{CoVe} (\textbf{Co}nstraint-\textbf{Ve}rification), a post-training data synthesis framework designed for training interactive tool-use agents while ensuring both data complexity and correctness. CoVe begins by defining explicit task constraints, which serve a dual role: they guide the generation of complex trajectories and act as deterministic verifiers for assessing trajectory quality. This enables the creation of high-quality training trajectories for supervised fine-tuning (SFT) and the derivation of accurate reward signals for reinforcement learning (RL). Our evaluation on the challenging $τ^2$-bench benchmark demonstrates the effectiveness of the framework. Notably, our compact \textbf{CoVe-4B} model achieves success rates of 43.0\% and 59.4\% in the Airline and Retail domains, respectively; its overall performance significantly outperforms strong baselines of similar scale and remains competitive with models up to $17\times$ its size. These results indicate that CoVe provides an effective and efficient pathway for synthesizing training data for state-of-the-art interactive tool-use agents. To support future research, we open-source our code, trained model, and the full set of 12K high-quality trajectories used for training.