Can GPT models Follow Human Summarization Guidelines? A Study for Targeted Communication Goals

📅 2023-10-25

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

This study investigates the capability of GPT-series models (ChatGPT, GPT-4, GPT-4o) to adhere to task-oriented human-authored summarization guidelines in English social dialogues and French customer-service calls. Method: We propose a human-centric, guideline-aligned evaluation framework—first of its kind across languages—integrating primary human assessment with complementary automated metrics (ROUGE, BERTScore). We introduce a bilingual dialogue summarization dataset (DialogSum/DECODA) and a structured prompt-based fine-tuning strategy. Contribution/Results: GPT models substantially outperform dedicated pre-trained summarizers, achieving the highest human preference scores for guideline compliance. However, their outputs tend to be longer and exhibit lexical and structural divergence from reference summaries. Crucially, we find significant misalignment between mainstream automatic metrics and human judgments, underscoring the irreplaceable role of human evaluation in guideline-constrained summarization.

📝 Abstract

This study investigates the ability of GPT models (ChatGPT, GPT-4 and GPT-4o) to generate dialogue summaries that adhere to human guidelines. Our evaluation involved experimenting with various prompts to guide the models in complying with guidelines on two datasets: DialogSum (English social conversations) and DECODA (French call center interactions). Human evaluation, based on summarization guidelines, served as the primary assessment method, complemented by extensive quantitative and qualitative analyses. Our findings reveal a preference for GPT-generated summaries over those from task-specific pre-trained models and reference summaries, highlighting GPT models' ability to follow human guidelines despite occasionally producing longer outputs and exhibiting divergent lexical and structural alignment with references. The discrepancy between ROUGE, BERTScore, and human evaluation underscores the need for more reliable automatic evaluation metrics.

Problem

Research questions and friction points this paper is trying to address.

Assess GPT models' adherence to human summarization guidelines

Compare GPT performance on English and French dialogue datasets

Evaluate discrepancy between automatic metrics and human assessment

Innovation

Methods, ideas, or system contributions that make the work stand out.

GPT models follow human summarization guidelines

Human evaluation preferred over automatic metrics

GPT outperforms task-specific pre-trained models

🔎 Similar Papers

No similar papers found.