SCORE: Specificity, Context Utilization, Robustness, and Relevance for Reference-Free LLM Evaluation

📅 2026-02-10

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

This study addresses the limitations of existing evaluation methods for large language models in effectively assessing whether their outputs in high-stakes, domain-specific scenarios contain the fine-grained information necessary for informed decision-making. To this end, the work proposes the first reference-free, four-dimensional evaluation framework that systematically measures the quality of domain-specific question answering along the axes of specificity, contextual utilization, robustness, and relevance. The authors construct a high-quality dataset spanning 40 professional roles and 7 types of natural disasters, integrating human evaluations with automatic metrics. They further design semantic perturbation and paraphrasing-based robustness tests to expose the subjectivity inherent in open-domain assessments. Experimental results demonstrate that multidimensional, synergistic evaluation substantially outperforms single-metric approaches in accurately judging the effectiveness of model outputs in critical decision-making contexts.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are increasingly used to support question answering and decision-making in high-stakes, domain-specific settings such as natural hazard response and infrastructure planning, where effective answers must convey fine-grained, decision-critical details. However, existing evaluation frameworks for retrieval-augmented generation (RAG) and open-ended question answering primarily rely on surface-level similarity, factual consistency, or semantic relevance, and often fail to assess whether responses provide the specific information required for domain-sensitive decisions. To address this gap, we propose a multi-dimensional, reference-free evaluation framework that assesses LLM outputs along four complementary dimensions: specificity, robustness to paraphrasing and semantic perturbations, answer relevance, and context utilization. We introduce a curated dataset of 1,412 domain-specific question-answer pairs spanning 40 professional roles and seven natural hazard types to support systematic evaluation. We further conduct human evaluation to assess inter-annotator agreement and alignment between model outputs and human judgments, which highlights the inherent subjectivity of open-ended, domain-specific evaluation. Our results show that no single metric sufficiently captures answer quality in isolation and demonstrate the need for structured, multi-metric evaluation frameworks when deploying LLMs in high-stakes applications.

Problem

Research questions and friction points this paper is trying to address.

reference-free evaluation

domain-specific question answering

retrieval-augmented generation

answer specificity

high-stakes LLM applications

Innovation

Methods, ideas, or system contributions that make the work stand out.

reference-free evaluation

multi-dimensional assessment

domain-specific LLM evaluation