Evaluating LLMs and Pre-trained Models for Text Summarization Across Diverse Datasets

📅 2025-02-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study systematically evaluates the cross-domain abstractive and extractive summarization capabilities of BART, FLAN-T5, LLaMA-3-8B, and Gemma-7B across five heterogeneous benchmarks: CNN/DM, Gigaword, News Summary, XSum, and BBC News. To address the lack of standardized comparative analysis, we conduct the first unified benchmarking of open-source large language models against classical pre-trained models, rigorously assessing generalization and robustness while probing how architectural choices and pretraining objectives influence summarization paradigms. We employ a standardized evaluation pipeline built on Hugging Face Transformers, using ROUGE, BERTScore, and METEOR for multi-dimensional automatic assessment. Results show that LLaMA-3-8B achieves a ROUGE-L score of 38.2 on XSum—significantly outperforming BART in abstractive summarization—while FLAN-T5 attains a METEOR score of 42.7 on CNN/DM, demonstrating the efficacy of instruction tuning for controllable extractive generation. This work provides empirical guidance for model selection and optimization in summarization research.

Technology Category

Application Category

📝 Abstract
Text summarization plays a crucial role in natural language processing by condensing large volumes of text into concise and coherent summaries. As digital content continues to grow rapidly and the demand for effective information retrieval increases, text summarization has become a focal point of research in recent years. This study offers a thorough evaluation of four leading pre-trained and open-source large language models: BART, FLAN-T5, LLaMA-3-8B, and Gemma-7B, across five diverse datasets CNN/DM, Gigaword, News Summary, XSum, and BBC News. The evaluation employs widely recognized automatic metrics, including ROUGE-1, ROUGE-2, ROUGE-L, BERTScore, and METEOR, to assess the models' capabilities in generating coherent and informative summaries. The results reveal the comparative strengths and limitations of these models in processing various text types.
Problem

Research questions and friction points this paper is trying to address.

Evaluate LLMs for text summarization
Assess models across diverse datasets
Compare model strengths and limitations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates pre-trained LLMs
Uses diverse datasets
Employs ROUGE, BERTScore metrics
🔎 Similar Papers
No similar papers found.