PQR: A Framework to Generate Diverse and Realistic User Queries that Elicit QA Agent Failures

📅 2026-05-15

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

Existing methods struggle to efficiently generate test queries that are simultaneously realistic, diverse, and capable of deliberately triggering failures—such as unhelpfulness or unsafe responses—in large language model–based question-answering agents. To address this challenge, this work proposes the PQR framework, which uniquely integrates real user intent modeling into the automated failure detection process. PQR employs an iterative co-optimization mechanism between query generation and prompt refinement, guided by objective violation and realism strategies, to produce high-quality test queries. Evaluated on e-commerce question-answering agents, PQR identifies 23%–78% more unhelpful responses than baseline approaches while significantly outperforming them in both query diversity and realism.

📝 Abstract

Evaluating LLM-based agents remains challenging because identifying meaningful failure cases often requires substantial human effort to design realistic test scenarios. Prior works primarily focus on automatically discovering agent failures induced by adversarial users, while overlooking queries with real user intents that also trigger agent failures. We introduce PQR, a framework that not only surfaces agent failures with respect to specific objectives (e.g., helpfulness, safety, etc.) but also resembles real users' intents. PQR operates through an iterative interaction between two complementary modules. The query refinement module performs rewrites to explore diverse query variations, while the prompt refinement module uses prior feedback to derive new objective-violating strategies and realism policies for refining prompts, which in turn generate failure-triggering yet realistic queries. We evaluate PQR on detecting an e-commerce QA agent's unhelpful responses. Our method uncovers 23% - 78% more unhelpful responses, and our generated queries are more diverse and realistic compared to previous methods.

Problem

Research questions and friction points this paper is trying to address.

LLM-based agents

failure detection

realistic user queries

evaluation

user intent

Innovation

Methods, ideas, or system contributions that make the work stand out.

query generation

LLM agent evaluation

realistic user intents