PricingLogic: Evaluating LLMs Reasoning on Complex Tourism Pricing Tasks

📅 2025-10-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses financial and trust risks arising from multi-rule interactions in tourism pricing. We introduce PricingLogic, the first benchmark specifically designed for evaluating large language models (LLMs) on complex fare logic. Built upon 42 real-world pricing policies, it comprises 300 natural-language questions spanning two difficulty tiers: customer-type-based pricing and combinatorial discounting. Evaluation employs multi-stage prompting and fine-grained reasoning assessment to rigorously test LLMs’ rule parsing and numerical reasoning capabilities. Experiments systematically reveal structural deficiencies in current LLMs—particularly pronounced performance degradation on higher-order compositional tasks. To our knowledge, PricingLogic is the first domain-specific, high-stakes evaluation benchmark for tourism pricing. Beyond benchmarking, this work advances research in domain adaptation, model interpretability, and safety mechanisms for revenue-critical AI deployments, providing both critical risk awareness and methodological foundations for trustworthy pricing automation.

Technology Category

Application Category

📝 Abstract
We present PricingLogic, the first benchmark that probes whether Large Language Models(LLMs) can reliably automate tourism-related prices when multiple, overlapping fare rules apply. Travel agencies are eager to offload this error-prone task onto AI systems; however, deploying LLMs without verified reliability could result in significant financial losses and erode customer trust. PricingLogic comprises 300 natural-language questions based on booking requests derived from 42 real-world pricing policies, spanning two levels of difficulty: (i) basic customer-type pricing and (ii)bundled-tour calculations involving interacting discounts. Evaluations of a line of LLMs reveal a steep performance drop on the harder tier,exposing systematic failures in rule interpretation and arithmetic reasoning.These results highlight that, despite their general capabilities, today's LLMs remain unreliable in revenue-critical applications without further safeguards or domain adaptation. Our code and dataset are available at https://github.com/EIT-NLP/PricingLogic.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to automate complex tourism pricing tasks
Testing reliability when multiple overlapping fare rules apply
Assessing systematic failures in rule interpretation and arithmetic reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

PricingLogic benchmark evaluates LLMs reasoning
Tests LLMs on complex tourism pricing rules
Reveals performance drop in rule interpretation arithmetic
🔎 Similar Papers
No similar papers found.
Y
Yunuo Liu
Hunan University
D
Dawei Zhu
Saarland University, Saarland Informatics Campus
Z
Zena Al-Khalili
Saarland University, Saarland Informatics Campus
D
Dai Cheng
Ningbo Key Laboratory of Spatial Intelligence and Digital, Derivative Institute of Digital Twin, EIT
Yanjun Chen
Yanjun Chen
University of Illinois Urbana-Champaign
Human Computer InteractionHaptics
Dietrich Klakow
Dietrich Klakow
Saarland University, Saarland Informatics Campus, PharmaScienceHub
Natural Language ProcessingSpeech ProcessingQuestion AnsweringMachine Learning
W
Wei Zhang
Ningbo Key Laboratory of Spatial Intelligence and Digital, Derivative Institute of Digital Twin, EIT
Xiaoyu Shen
Xiaoyu Shen
Eastern Institute of Technology, Ningbo
language modelmulti-modal learningreasoning