π€ AI Summary
This work addresses a critical limitation in existing benchmarks for large language model (LLM) agents, which typically assume explicit and unambiguous policies, thereby failing to evaluate agent decision-making under the policy ambiguity prevalent in real-world settings. To bridge this gap, we introduce DRIP-Rβthe first evaluation benchmark grounded in authentic policy ambiguity within retail return scenarios. DRIP-R simulates realistic customer personas, supports full-duplex dialogue, and incorporates tool use to construct open-ended conversational contexts without a single correct answer. We further develop a multi-dimensional, multi-rater human evaluation framework that systematically reveals substantial divergence in decisions made by state-of-the-art LLMs under identical ambiguous conditions. Our findings demonstrate that policy ambiguity poses a systematic challenge to LLMs, establishing DRIP-R as a novel benchmark for advancing research on agent robustness and consistency.
π Abstract
LLM-based agents are increasingly deployed for routine but consequential tasks in real-world domains, where their behavior is governed by inherently ambiguous domain policies that admit multiple valid interpretations. Despite the prevalence of such ambiguities in practice, existing agent benchmarks largely assume unambiguous, well-specified policies, leaving a critical evaluation gap. We introduce DRIP-R, a benchmark that systematically exploits real-world retail policy ambiguities to construct scenarios in which no single correct resolution exists. DRIP-R comprises a curated set of policy-ambiguous return scenarios paired with a realistic customer personas, a full-duplex conversational simulation with tool-calling capabilities and a multi-judge evaluation framework covering policy adherence, dialogue quality, behavioral alignment, and resolution quality. Our experiments show that frontier models fundamentally disagree on identical policy-ambiguous scenarios, confirming that ambiguity poses a genuine and systematic challenge to LLM decision-making.