🤖 AI Summary
To address the scarcity of real customer queries and the unreliability of tool-use trajectories in industrial LLM agent training, this paper proposes a two-stage multi-agent framework: “Generation–Inverse Validation.” First, agents collaboratively generate high-fidelity customer queries; second, they perform inverse reasoning over model responses to verify the correctness of tool-call trajectories. We further introduce a lightweight, distant-supervision-based trajectory discriminator—replacing costly large-model judges (e.g., GPT-4) with efficient, interpretable traditional models such as XGBoost. Experiments demonstrate that the synthesized data significantly improves agent generalization on real-world queries. Our trajectory validation achieves 11% higher accuracy than the GPT-4o baseline, matching GPT-4-level performance while offering superior computational efficiency and model interpretability.
📝 Abstract
Extending the capabilities of Large Language Models (LLMs) with functions or tools for environment interaction has led to the emergence of the agent paradigm. In industry, training an LLM is not always feasible because of the scarcity of domain data, legal holds on proprietary customer data, rapidly changing business requirements, and the need to prototype new assistants. Agents provide an elegant solution to the above by relying on the zero-shot reasoning abilities of the underlying LLM and utilizing tools to explore and reason over customer data and respond to user requests. However, there are two concerns here: (I) acquiring large scale customer queries for agent testing is time-consuming, and (II) high reliance on the tool call sequence (or trajectory) followed by the agent to respond to user queries may lead to unexpected or incorrect behavior. To address this, we propose MAG-V, a multi-agent framework to first generate a dataset of questions that mimic customer queries; and second, reverse-engineer alternate questions from the responses for trajectory verification. Initial results indicate that our synthetic data can improve agent performance on actual customer queries. Furthermore, our trajectory verification methodology, inspired by distant supervision and using traditional machine learning (ML) models, outperforms a GPT-4o judge baseline by 11% accuracy and matches the performance of a GPT-4 judge on our constructed dataset. Overall, our approach is a step towards unifying diverse task agents into a cohesive framework for achieving an aligned objective.