🤖 AI Summary
This work addresses the poor performance of existing tool-calling agents when confronted with ambiguous, dynamically evolving, or infeasible user intents constrained by policy limitations—primarily due to a lack of training and evaluation data covering such complex scenarios. To bridge this gap, the authors propose Trajectory2Task, a framework that systematically constructs a verifiable synthetic data generation pipeline. By leveraging multi-turn trajectory exploration, controllable intent transformation, and task formulation, the framework produces complex tasks encompassing three realistic challenges. Experimental results demonstrate that leading large language models generally underperform on these tasks, whereas lightweight models fine-tuned on successful trajectories achieve substantially improved performance and exhibit stronger cross-domain generalization capabilities.
📝 Abstract
Tool-calling agents are increasingly deployed in real-world customer-facing workflows. Yet most studies on tool-calling agents focus on idealized settings with general, fixed, and well-specified tasks. In real-world applications, user requests are often (1) ambiguous, (2) changing over time, or (3) infeasible due to policy constraints, and training and evaluation data that cover these diverse, complex interaction patterns remain under-represented. To bridge the gap, we present Trajectory2Task, a verifiable data generation pipeline for studying tool use at scale under three realistic user scenarios: ambiguous intent, changing intent, and infeasible intents. The pipeline first conducts multi-turn exploration to produce valid tool-call trajectories. It then converts these trajectories into user-facing tasks with controlled intent adaptations. This process yields verifiable task that support closed-loop evaluation and training. We benchmark seven state-of-the-art LLMs on the generated complex user scenario tasks and observe frequent failures. Finally, using successful trajectories obtained from task rollouts, we fine-tune lightweight LLMs and find consistent improvements across all three conditions, along with better generalization to unseen tool-use domains, indicating stronger general tool-calling ability.