🤖 AI Summary
This work addresses financial and trust risks arising from multi-rule interactions in tourism pricing. We introduce PricingLogic, the first benchmark specifically designed for evaluating large language models (LLMs) on complex fare logic. Built upon 42 real-world pricing policies, it comprises 300 natural-language questions spanning two difficulty tiers: customer-type-based pricing and combinatorial discounting. Evaluation employs multi-stage prompting and fine-grained reasoning assessment to rigorously test LLMs’ rule parsing and numerical reasoning capabilities. Experiments systematically reveal structural deficiencies in current LLMs—particularly pronounced performance degradation on higher-order compositional tasks. To our knowledge, PricingLogic is the first domain-specific, high-stakes evaluation benchmark for tourism pricing. Beyond benchmarking, this work advances research in domain adaptation, model interpretability, and safety mechanisms for revenue-critical AI deployments, providing both critical risk awareness and methodological foundations for trustworthy pricing automation.
📝 Abstract
We present PricingLogic, the first benchmark that probes whether Large Language Models(LLMs) can reliably automate tourism-related prices when multiple, overlapping fare rules apply. Travel agencies are eager to offload this error-prone task onto AI systems; however, deploying LLMs without verified reliability could result in significant financial losses and erode customer trust. PricingLogic comprises 300 natural-language questions based on booking requests derived from 42 real-world pricing policies, spanning two levels of difficulty: (i) basic customer-type pricing and (ii)bundled-tour calculations involving interacting discounts. Evaluations of a line of LLMs reveal a steep performance drop on the harder tier,exposing systematic failures in rule interpretation and arithmetic reasoning.These results highlight that, despite their general capabilities, today's LLMs remain unreliable in revenue-critical applications without further safeguards or domain adaptation. Our code and dataset are available at https://github.com/EIT-NLP/PricingLogic.