Exploring the features used for summary evaluation by Human and GPT

πŸ“… 2025-12-22
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the lack of interpretability and unclear mapping between large language models (LLMs) such as GPT and human judgments in summary evaluation. We systematically disentangle the semantic and statistical features underlying human and LLM assessments across dimensions including coherence and faithfulness. Methodologically, we integrate statistical metrics (e.g., ROUGE, BERTScore), explainable AI techniques (SHAP, LIME), and prompt-engineering-driven instruction tuning of GPT. We identify, for the first time, six critical evaluation features and propose a novel paradigmβ€”β€œhuman-metric-guided GPT evaluation.” Experiments demonstrate that our approach improves the average Pearson correlation between GPT-based scores and human ratings by 23.7%, while exhibiting strong cross-dataset generalization. This significantly enhances both the reliability and interpretability of LLMs as automated evaluators.

Technology Category

Application Category

πŸ“ Abstract
Summary assessment involves evaluating how well a generated summary reflects the key ideas and meaning of the source text, requiring a deep understanding of the content. Large Language Models (LLMs) have been used to automate this process, acting as judges to evaluate summaries with respect to the original text. While previous research investigated the alignment between LLMs and Human responses, it is not yet well understood what properties or features are exploited by them when asked to evaluate based on a particular quality dimension, and there has not been much attention towards mapping between evaluation scores and metrics. In this paper, we address this issue and discover features aligned with Human and Generative Pre-trained Transformers (GPTs) responses by studying statistical and machine learning metrics. Furthermore, we show that instructing GPTs to employ metrics used by Human can improve their judgment and conforming them better with human responses.
Problem

Research questions and friction points this paper is trying to address.

Identifies features used by humans and GPTs for summary evaluation
Maps evaluation scores to statistical and machine learning metrics
Improves GPT judgment by aligning with human evaluation metrics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Using statistical and machine learning metrics to discover evaluation features
Instructing GPTs to employ human-used metrics for better judgment
Mapping evaluation scores to metrics for improved alignment with humans
πŸ”Ž Similar Papers
No similar papers found.