Judge Like Human Examiners: A Weighted Importance Multi-Point Evaluation Framework for Generative Tasks with Long-form Answers

📅 2026-04-13
📈 Citations: 0
Influential: 0
📄 PDF

career value

179K/year
🤖 AI Summary
This work addresses two key challenges in evaluating long-form text generation: the difficulty of assessing whether a response remains faithful to its source context and the inability of existing metrics to account for varying importance among elements in reference answers. To tackle these issues, the authors propose the Weighted Importance Multi-Point Evaluation (WIMPE) framework, inspired by human grading practices. WIMPE decomposes reference answers into context-bound scoring points annotated with explicit weights and introduces two novel mechanisms—Weighted Point Alignment (WPA) for fine-grained semantic matching and Point-level Conflict Penalty (PCP) for detecting contradictions. Experimental results across ten generation tasks demonstrate that WIMPE achieves significantly higher correlation with human judgments than current evaluation methods, thereby enhancing both the accuracy and granularity of automated assessment.

Technology Category

Application Category

📝 Abstract
Evaluating the quality of model responses remains challenging in generative tasks with long-form answers, as the expected answers usually contain multiple semantically distinct yet complementary factors that should be factorized for fine-grained assessment. Recent evaluation methods resort to relying on either task-level rubrics or question-aware checklists. However, they still 1) struggle to assess whether a response is genuinely grounded in provided contexts; 2) fail to capture the heterogeneous importance of different aspects of reference answers. Inspired by human examiners, we propose a Weighted Importance Multi-Point Evaluation (WIMPE) framework, which factorizes each reference answer into weighted context-bound scoring points. Two complementary metrics, namely Weighted Point-wise Alignment (WPA) and Point-wise Conflict Penalty (PCP), are designed to measure the alignment and contradiction between model responses and reference answers. Extensive experiments on 10 generative tasks demonstrate that WIMPE achieves higher correlations with human annotations.
Problem

Research questions and friction points this paper is trying to address.

generative tasks
long-form answers
evaluation framework
answer quality
fine-grained assessment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Weighted Importance Multi-Point Evaluation
context-bound scoring points
Weighted Point-wise Alignment
Point-wise Conflict Penalty
fine-grained evaluation