PricingLogic: Evaluating LLMs Reasoning on Complex Tourism Pricing Tasks

📅 2025-10-14

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This work addresses financial and trust risks arising from multi-rule interactions in tourism pricing. We introduce PricingLogic, the first benchmark specifically designed for evaluating large language models (LLMs) on complex fare logic. Built upon 42 real-world pricing policies, it comprises 300 natural-language questions spanning two difficulty tiers: customer-type-based pricing and combinatorial discounting. Evaluation employs multi-stage prompting and fine-grained reasoning assessment to rigorously test LLMs’ rule parsing and numerical reasoning capabilities. Experiments systematically reveal structural deficiencies in current LLMs—particularly pronounced performance degradation on higher-order compositional tasks. To our knowledge, PricingLogic is the first domain-specific, high-stakes evaluation benchmark for tourism pricing. Beyond benchmarking, this work advances research in domain adaptation, model interpretability, and safety mechanisms for revenue-critical AI deployments, providing both critical risk awareness and methodological foundations for trustworthy pricing automation.

Technology Category

Application Category

📝 Abstract

We present PricingLogic, the first benchmark that probes whether Large Language Models(LLMs) can reliably automate tourism-related prices when multiple, overlapping fare rules apply. Travel agencies are eager to offload this error-prone task onto AI systems; however, deploying LLMs without verified reliability could result in significant financial losses and erode customer trust. PricingLogic comprises 300 natural-language questions based on booking requests derived from 42 real-world pricing policies, spanning two levels of difficulty: (i) basic customer-type pricing and (ii)bundled-tour calculations involving interacting discounts. Evaluations of a line of LLMs reveal a steep performance drop on the harder tier,exposing systematic failures in rule interpretation and arithmetic reasoning.These results highlight that, despite their general capabilities, today's LLMs remain unreliable in revenue-critical applications without further safeguards or domain adaptation. Our code and dataset are available at https://github.com/EIT-NLP/PricingLogic.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to automate complex tourism pricing tasks

Testing reliability when multiple overlapping fare rules apply

Assessing systematic failures in rule interpretation and arithmetic reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

PricingLogic benchmark evaluates LLMs reasoning

Tests LLMs on complex tourism pricing rules

Reveals performance drop in rule interpretation arithmetic

🔎 Similar Papers

Algorithmic Collusion by Large Language Models