LETToT: Label-Free Evaluation of Large Language Models On Tourism Using Expert Tree-of-Thought

📅 2025-08-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Evaluating large language models (LLMs) in tourism faces two key challenges: high annotation costs and hallucination-induced noise. To address these, we propose LETToT—a novel, annotation-free, scalable, domain-specific evaluation framework. LETToT introduces a hierarchically structured, expert-designed, and iteratively refined “thought tree” that aligns generic quality dimensions (e.g., accuracy, conciseness) with domain-expert feedback. It enables systematic analysis of how model scale and reasoning capabilities jointly affect output quality. Experiments demonstrate that reasoning-augmented smaller models (≤72B parameters) significantly outperform same-scale baselines in both accuracy and conciseness (p < 0.05), achieving quality improvements of 4.99%–14.15%. Furthermore, we empirically validate the applicability of scaling laws to specialized domains. LETToT establishes an efficient, reliable, and interpretable paradigm for vertical-domain LLM evaluation, advancing beyond reliance on costly human annotations or generic benchmarks.

Technology Category

Application Category

📝 Abstract
Evaluating large language models (LLMs) in specific domain like tourism remains challenging due to the prohibitive cost of annotated benchmarks and persistent issues like hallucinations. We propose $ extbf{L}$able-Free $ extbf{E}$valuation of LLM on $ extbf{T}$ourism using Expert $ extbf{T}$ree-$ extbf{o}$f-$ extbf{T}$hought (LETToT), a framework that leverages expert-derived reasoning structures-instead of labeled data-to access LLMs in tourism. First, we iteratively refine and validate hierarchical ToT components through alignment with generic quality dimensions and expert feedback. Results demonstrate the effectiveness of our systematically optimized expert ToT with 4.99-14.15% relative quality gains over baselines. Second, we apply LETToT's optimized expert ToT to evaluate models of varying scales (32B-671B parameters), revealing: (1) Scaling laws persist in specialized domains (DeepSeek-V3 leads), yet reasoning-enhanced smaller models (e.g., DeepSeek-R1-Distill-Llama-70B) close this gap; (2) For sub-72B models, explicit reasoning architectures outperform counterparts in accuracy and conciseness ($p<0.05$). Our work established a scalable, label-free paradigm for domain-specific LLM evaluation, offering a robust alternative to conventional annotated benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs in tourism without labeled data
Overcoming high costs and hallucinations in domain-specific LLM assessment
Enhancing model performance via expert Tree-of-Thought reasoning structures
Innovation

Methods, ideas, or system contributions that make the work stand out.

Expert Tree-of-Thought replaces labeled data
Hierarchical ToT components refined iteratively
Scalable label-free LLM evaluation paradigm
🔎 Similar Papers
No similar papers found.