🤖 AI Summary
Existing evaluation methods struggle to simultaneously balance multiple objectives—such as relevance, diversity, sustainability, and popularity—under cost constraints, particularly lacking an integrated assessment framework for sustainable urban travel recommendations. This work proposes a three-stage calibration framework that, for the first time, integrates human expert feedback, dimension-specific rules, and few-shot examples with a multi-model LLM-as-a-Judge approach to identify and correct systematic biases in large language model (LLM) evaluations. Experimental results reveal significant systematic biases and high variance across different LLMs on various dimensions. After calibration, the clarity and coherence of evaluation reasoning substantially improve, uncovering subjective interpretative discrepancies—especially concerning sustainability—and ultimately yielding a more interpretable and transparent multidimensional recommendation evaluation system.
📝 Abstract
Evaluating nuanced conversational travel recommendations is challenging when human annotations are costly and standard metrics ignore stakeholder-centric goals. We study LLMs-as-Judges for sustainable city-trip lists across four dimensions -- relevance, diversity, sustainability, and popularity balance, and propose a three-phase calibration framework: (1) baseline judging with multiple LLMs, (2) expert evaluation to identify systematic misalignment, and (3) dimension-specific calibration via rules and few-shot examples. Across two recommendation settings, we observe model-specific biases and high dimension-level variance, even when judges agree on overall rankings. Calibration clarifies reasoning per dimension but exposes divergent interpretations of sustainability, highlighting the need for transparent, bias-aware LLM evaluation. Prompts and code are released for reproducibility: https://github.com/ashmibanerjee/trs-llm-calibration.