🤖 AI Summary
Existing methods struggle to efficiently generate test queries that are simultaneously realistic, diverse, and capable of deliberately triggering failures—such as unhelpfulness or unsafe responses—in large language model–based question-answering agents. To address this challenge, this work proposes the PQR framework, which uniquely integrates real user intent modeling into the automated failure detection process. PQR employs an iterative co-optimization mechanism between query generation and prompt refinement, guided by objective violation and realism strategies, to produce high-quality test queries. Evaluated on e-commerce question-answering agents, PQR identifies 23%–78% more unhelpful responses than baseline approaches while significantly outperforming them in both query diversity and realism.
📝 Abstract
Evaluating LLM-based agents remains challenging because identifying meaningful failure cases often requires substantial human effort to design realistic test scenarios. Prior works primarily focus on automatically discovering agent failures induced by adversarial users, while overlooking queries with real user intents that also trigger agent failures. We introduce PQR, a framework that not only surfaces agent failures with respect to specific objectives (e.g., helpfulness, safety, etc.) but also resembles real users' intents. PQR operates through an iterative interaction between two complementary modules. The query refinement module performs rewrites to explore diverse query variations, while the prompt refinement module uses prior feedback to derive new objective-violating strategies and realism policies for refining prompts, which in turn generate failure-triggering yet realistic queries. We evaluate PQR on detecting an e-commerce QA agent's unhelpful responses. Our method uncovers 23% - 78% more unhelpful responses, and our generated queries are more diverse and realistic compared to previous methods.