🤖 AI Summary
This study systematically evaluates the cross-domain abstractive and extractive summarization capabilities of BART, FLAN-T5, LLaMA-3-8B, and Gemma-7B across five heterogeneous benchmarks: CNN/DM, Gigaword, News Summary, XSum, and BBC News. To address the lack of standardized comparative analysis, we conduct the first unified benchmarking of open-source large language models against classical pre-trained models, rigorously assessing generalization and robustness while probing how architectural choices and pretraining objectives influence summarization paradigms. We employ a standardized evaluation pipeline built on Hugging Face Transformers, using ROUGE, BERTScore, and METEOR for multi-dimensional automatic assessment. Results show that LLaMA-3-8B achieves a ROUGE-L score of 38.2 on XSum—significantly outperforming BART in abstractive summarization—while FLAN-T5 attains a METEOR score of 42.7 on CNN/DM, demonstrating the efficacy of instruction tuning for controllable extractive generation. This work provides empirical guidance for model selection and optimization in summarization research.
📝 Abstract
Text summarization plays a crucial role in natural language processing by condensing large volumes of text into concise and coherent summaries. As digital content continues to grow rapidly and the demand for effective information retrieval increases, text summarization has become a focal point of research in recent years. This study offers a thorough evaluation of four leading pre-trained and open-source large language models: BART, FLAN-T5, LLaMA-3-8B, and Gemma-7B, across five diverse datasets CNN/DM, Gigaword, News Summary, XSum, and BBC News. The evaluation employs widely recognized automatic metrics, including ROUGE-1, ROUGE-2, ROUGE-L, BERTScore, and METEOR, to assess the models' capabilities in generating coherent and informative summaries. The results reveal the comparative strengths and limitations of these models in processing various text types.