LLM-ReSum: A Framework for LLM Reflective Summarization through Self-Evaluation

📅 2026-04-28

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

This work addresses the poor alignment between automatic evaluation of large language model (LLM)-generated summaries and human judgments, particularly in cross-domain and multi-length document settings. To tackle this issue, the authors propose LLM-ReSum, the first fine-tuning-free, closed-loop self-reflective summarization framework that tightly integrates LLM-based self-evaluation with summary generation, iteratively refining output quality through feedback. They introduce PatentSumEval, a new legal-domain benchmark, and employ large-scale meta-evaluation to identify highly human-aligned LLM evaluators. Experimental results demonstrate that LLM-ReSum improves factual accuracy by 33% and coverage by 39% for low-quality summaries across three domains, achieving an 89% human preference rate.

📝 Abstract

Reliable evaluation of large language model (LLM)-generated summaries remains an open challenge, particularly across heterogeneous domains and document lengths. We conduct a comprehensive meta-evaluation of 14 automatic summarization metrics and LLM-based evaluators across seven datasets spanning five domains, covering documents from short news articles to long scientific, governmental, and legal texts (2K-27K words) with over 1,500 human-annotated summaries. Our results show that traditional lexical overlap metrics (e.g., ROUGE, BLEU) exhibit weak or negative correlation with human judgments, while task-specific neural metrics and LLM-based evaluators achieve substantially higher alignment, especially for linguistic quality assessment. Leveraging these findings, we propose LLM-ReSum, a self-reflective summarization framework that integrates LLM-based evaluation and generation in a closed feedback loop without model finetuning. Across three domains, LLM-ReSum improves low-quality summaries by up to 33% in factual accuracy and 39% in coverage, with human evaluators preferring refined summaries in 89% of cases. We additionally introduce PatentSumEval, a new human-annotated benchmark for legal document summarization comprising 180 expert-evaluated summaries. All code and datasets will be released in GitHub.

Problem

Research questions and friction points this paper is trying to address.

LLM summarization

summary evaluation

heterogeneous domains

automatic metrics

human judgment alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based evaluation

reflective summarization

self-evaluation

closed-loop feedback

PatentSumEval

🔎 Similar Papers

LitLLM: A Toolkit for Scientific Literature Review

2024-02-02arXiv.orgCitations: 17