🤖 AI Summary
This study investigates the capability of GPT-series models (ChatGPT, GPT-4, GPT-4o) to adhere to task-oriented human-authored summarization guidelines in English social dialogues and French customer-service calls. Method: We propose a human-centric, guideline-aligned evaluation framework—first of its kind across languages—integrating primary human assessment with complementary automated metrics (ROUGE, BERTScore). We introduce a bilingual dialogue summarization dataset (DialogSum/DECODA) and a structured prompt-based fine-tuning strategy. Contribution/Results: GPT models substantially outperform dedicated pre-trained summarizers, achieving the highest human preference scores for guideline compliance. However, their outputs tend to be longer and exhibit lexical and structural divergence from reference summaries. Crucially, we find significant misalignment between mainstream automatic metrics and human judgments, underscoring the irreplaceable role of human evaluation in guideline-constrained summarization.
📝 Abstract
This study investigates the ability of GPT models (ChatGPT, GPT-4 and GPT-4o) to generate dialogue summaries that adhere to human guidelines. Our evaluation involved experimenting with various prompts to guide the models in complying with guidelines on two datasets: DialogSum (English social conversations) and DECODA (French call center interactions). Human evaluation, based on summarization guidelines, served as the primary assessment method, complemented by extensive quantitative and qualitative analyses. Our findings reveal a preference for GPT-generated summaries over those from task-specific pre-trained models and reference summaries, highlighting GPT models' ability to follow human guidelines despite occasionally producing longer outputs and exhibiting divergent lexical and structural alignment with references. The discrepancy between ROUGE, BERTScore, and human evaluation underscores the need for more reliable automatic evaluation metrics.