🤖 AI Summary
Evaluating large language models (LLMs) in tourism faces two key challenges: high annotation costs and hallucination-induced noise. To address these, we propose LETToT—a novel, annotation-free, scalable, domain-specific evaluation framework. LETToT introduces a hierarchically structured, expert-designed, and iteratively refined “thought tree” that aligns generic quality dimensions (e.g., accuracy, conciseness) with domain-expert feedback. It enables systematic analysis of how model scale and reasoning capabilities jointly affect output quality. Experiments demonstrate that reasoning-augmented smaller models (≤72B parameters) significantly outperform same-scale baselines in both accuracy and conciseness (p < 0.05), achieving quality improvements of 4.99%–14.15%. Furthermore, we empirically validate the applicability of scaling laws to specialized domains. LETToT establishes an efficient, reliable, and interpretable paradigm for vertical-domain LLM evaluation, advancing beyond reliance on costly human annotations or generic benchmarks.
📝 Abstract
Evaluating large language models (LLMs) in specific domain like tourism remains challenging due to the prohibitive cost of annotated benchmarks and persistent issues like hallucinations. We propose $ extbf{L}$able-Free $ extbf{E}$valuation of LLM on $ extbf{T}$ourism using Expert $ extbf{T}$ree-$ extbf{o}$f-$ extbf{T}$hought (LETToT), a framework that leverages expert-derived reasoning structures-instead of labeled data-to access LLMs in tourism. First, we iteratively refine and validate hierarchical ToT components through alignment with generic quality dimensions and expert feedback. Results demonstrate the effectiveness of our systematically optimized expert ToT with 4.99-14.15% relative quality gains over baselines. Second, we apply LETToT's optimized expert ToT to evaluate models of varying scales (32B-671B parameters), revealing: (1) Scaling laws persist in specialized domains (DeepSeek-V3 leads), yet reasoning-enhanced smaller models (e.g., DeepSeek-R1-Distill-Llama-70B) close this gap; (2) For sub-72B models, explicit reasoning architectures outperform counterparts in accuracy and conciseness ($p<0.05$). Our work established a scalable, label-free paradigm for domain-specific LLM evaluation, offering a robust alternative to conventional annotated benchmarks.