🤖 AI Summary
Large language models (LLMs) often generate conference summaries suffering from hallucination, omission, and irrelevance. To address these issues, this paper proposes a semantic-enhanced summarization framework comprising two core components: (1) FRAME, a modular pipeline integrating fact extraction, importance scoring, topic clustering, and chain-of-questions prompting; and (2) SCOPE, a problem-driven reasoning protocol that jointly ensures content fidelity and reader adaptability. Furthermore, we introduce P-MESA—a reference-free, multidimensional unsupervised evaluation framework—that quantifies summary quality along three axes: faithfulness, coverage, and personalization. Experiments on QMSum and FAME demonstrate that our approach reduces hallucination and omission errors by 40%, achieves 89.2% accuracy in P-MESA scoring, and attains strong correlation with human judgments (Pearson’s *r* = 0.73), significantly improving both summary quality and individual relevance.
📝 Abstract
Meeting summarization with large language models (LLMs) remains error-prone, often producing outputs with hallucinations, omissions, and irrelevancies. We present FRAME, a modular pipeline that reframes summarization as a semantic enrichment task. FRAME extracts and scores salient facts, organizes them thematically, and uses these to enrich an outline into an abstractive summary. To personalize summaries, we introduce SCOPE, a reason-out-loud protocol that has the model build a reasoning trace by answering nine questions before content selection. For evaluation, we propose P-MESA, a multi-dimensional, reference-free evaluation framework to assess if a summary fits a target reader. P-MESA reliably identifies error instances, achieving >= 89% balanced accuracy against human annotations and strongly aligns with human severity ratings (r >= 0.70). On QMSum and FAME, FRAME reduces hallucination and omission by 2 out of 5 points (measured with MESA), while SCOPE improves knowledge fit and goal alignment over prompt-only baselines. Our findings advocate for rethinking summarization to improve control, faithfulness, and personalization.