Can GPT models Follow Human Summarization Guidelines? A Study for Targeted Communication Goals

📅 2023-10-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the capability of GPT-series models (ChatGPT, GPT-4, GPT-4o) to adhere to task-oriented human-authored summarization guidelines in English social dialogues and French customer-service calls. Method: We propose a human-centric, guideline-aligned evaluation framework—first of its kind across languages—integrating primary human assessment with complementary automated metrics (ROUGE, BERTScore). We introduce a bilingual dialogue summarization dataset (DialogSum/DECODA) and a structured prompt-based fine-tuning strategy. Contribution/Results: GPT models substantially outperform dedicated pre-trained summarizers, achieving the highest human preference scores for guideline compliance. However, their outputs tend to be longer and exhibit lexical and structural divergence from reference summaries. Crucially, we find significant misalignment between mainstream automatic metrics and human judgments, underscoring the irreplaceable role of human evaluation in guideline-constrained summarization.
📝 Abstract
This study investigates the ability of GPT models (ChatGPT, GPT-4 and GPT-4o) to generate dialogue summaries that adhere to human guidelines. Our evaluation involved experimenting with various prompts to guide the models in complying with guidelines on two datasets: DialogSum (English social conversations) and DECODA (French call center interactions). Human evaluation, based on summarization guidelines, served as the primary assessment method, complemented by extensive quantitative and qualitative analyses. Our findings reveal a preference for GPT-generated summaries over those from task-specific pre-trained models and reference summaries, highlighting GPT models' ability to follow human guidelines despite occasionally producing longer outputs and exhibiting divergent lexical and structural alignment with references. The discrepancy between ROUGE, BERTScore, and human evaluation underscores the need for more reliable automatic evaluation metrics.
Problem

Research questions and friction points this paper is trying to address.

Assess GPT models' adherence to human summarization guidelines
Compare GPT performance on English and French dialogue datasets
Evaluate discrepancy between automatic metrics and human assessment
Innovation

Methods, ideas, or system contributions that make the work stand out.

GPT models follow human summarization guidelines
Human evaluation preferred over automatic metrics
GPT outperforms task-specific pre-trained models
🔎 Similar Papers
No similar papers found.
Y
Yongxin Zhou
Univ. Grenoble Alpes, CNRS, Inria, Grenoble INP, LIG, 38000 Grenoble, France
Fabien Ringeval
Fabien Ringeval
Associate Professor, Université Grenoble Alpes, LIG, France
Speech processingMachine learningAffective ComputingAtypical Communication
F
Franccois Portet
Univ. Grenoble Alpes, CNRS, Inria, Grenoble INP, LIG, 38000 Grenoble, France