DRIP-R: A Benchmark for Decision-Making and Reasoning Under Real-World Policy Ambiguity in the Retail Domain

πŸ“… 2026-05-08
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

215K/year
πŸ€– AI Summary
This work addresses a critical limitation in existing benchmarks for large language model (LLM) agents, which typically assume explicit and unambiguous policies, thereby failing to evaluate agent decision-making under the policy ambiguity prevalent in real-world settings. To bridge this gap, we introduce DRIP-Rβ€”the first evaluation benchmark grounded in authentic policy ambiguity within retail return scenarios. DRIP-R simulates realistic customer personas, supports full-duplex dialogue, and incorporates tool use to construct open-ended conversational contexts without a single correct answer. We further develop a multi-dimensional, multi-rater human evaluation framework that systematically reveals substantial divergence in decisions made by state-of-the-art LLMs under identical ambiguous conditions. Our findings demonstrate that policy ambiguity poses a systematic challenge to LLMs, establishing DRIP-R as a novel benchmark for advancing research on agent robustness and consistency.
πŸ“ Abstract
LLM-based agents are increasingly deployed for routine but consequential tasks in real-world domains, where their behavior is governed by inherently ambiguous domain policies that admit multiple valid interpretations. Despite the prevalence of such ambiguities in practice, existing agent benchmarks largely assume unambiguous, well-specified policies, leaving a critical evaluation gap. We introduce DRIP-R, a benchmark that systematically exploits real-world retail policy ambiguities to construct scenarios in which no single correct resolution exists. DRIP-R comprises a curated set of policy-ambiguous return scenarios paired with a realistic customer personas, a full-duplex conversational simulation with tool-calling capabilities and a multi-judge evaluation framework covering policy adherence, dialogue quality, behavioral alignment, and resolution quality. Our experiments show that frontier models fundamentally disagree on identical policy-ambiguous scenarios, confirming that ambiguity poses a genuine and systematic challenge to LLM decision-making.
Problem

Research questions and friction points this paper is trying to address.

policy ambiguity
LLM-based agents
decision-making
retail domain
benchmark
Innovation

Methods, ideas, or system contributions that make the work stand out.

policy ambiguity
LLM agent benchmark
retail decision-making
multi-judge evaluation
full-duplex conversation simulation