Rethinking Scientific Summarization Evaluation: Grounding Explainable Metrics on Facet-aware Benchmark

📅 2024-02-22

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Existing scientific summarization evaluation methods—such as n-gram overlap, embedding similarity, and QA-based matching—lack interpretability and domain adaptability, failing to capture scientific concept understanding and coverage of key conceptual facets. To address this, we propose the first fine-grained, multi-faceted evaluation paradigm tailored to scientific literature, introducing the Facet-aware Metric (FM) and its accompanying human-annotated Facet-based Dataset (FD). FM integrates large language model (LLM)-driven semantic matching, scientific facet disentanglement modeling, and small-model fine-tuning to enable concept-aware, dimension-wise, and interpretable assessment. Experiments demonstrate that FM significantly outperforms conventional metrics across multiple scientific domains. Moreover, small models fine-tuned on FD achieve performance on par with LLMs in scientific summarization evaluation, revealing inherent limitations of LLMs in scientific contextual learning. This work establishes a new foundation for rigorous, transparent, and domain-grounded evaluation of scientific summaries.

Technology Category

Application Category

📝 Abstract

The summarization capabilities of pretrained and large language models (LLMs) have been widely validated in general areas, but their use in scientific corpus, which involves complex sentences and specialized knowledge, has been less assessed. This paper presents conceptual and experimental analyses of scientific summarization, highlighting the inadequacies of traditional evaluation methods, such as n-gram, embedding comparison, and QA, particularly in providing explanations, grasping scientific concepts, or identifying key content. Subsequently, we introduce the Facet-aware Metric (FM), employing LLMs for advanced semantic matching to evaluate summaries based on different aspects. This facet-aware approach offers a thorough evaluation of abstracts by decomposing the evaluation task into simpler subtasks. Recognizing the absence of an evaluation benchmark in this domain, we curate a Facet-based scientific summarization Dataset (FD) with facet-level annotations. Our findings confirm that FM offers a more logical approach to evaluating scientific summaries. In addition, fine-tuned smaller models can compete with LLMs in scientific contexts, while LLMs have limitations in learning from in-context information in scientific domains. This suggests an area for future enhancement of LLMs1.

Problem

Research questions and friction points this paper is trying to address.

Evaluating scientific summaries with traditional methods lacks explanation and concept grasp

Introducing Facet-aware Metric (FM) for advanced semantic matching in summaries

Assessing LLM limitations and smaller models' potential in scientific summarization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Facet-aware Metric for semantic evaluation

Employs LLMs for advanced facet-based matching

Curates facet-annotated dataset for scientific summarization

🔎 Similar Papers

No similar papers found.