🤖 AI Summary
This work addresses the lack of fairness at the social attribute level in multi-document summarization (MDS). We propose the first coverage-driven, two-layer fairness evaluation framework. Methodologically, we innovatively define *Equal Coverage* at the summary level and *Coverage Parity* at the corpus level, explicitly modeling how document redundancy affects fairness—and revealing systematic overrepresentation of specific social attributes by large language models (LLMs). Through coverage-based quantitative analysis, social attribute annotation, human judgment experiments, and benchmarking across 13 LLMs, we demonstrate strong alignment between our metrics and human fairness assessments (Spearman’s ρ > 0.89). Results show that Claude3-sonnet achieves the highest fairness performance, while most LLMs exhibit significant social attribute bias. This work establishes an interpretable, reproducible theoretical and empirical foundation for fairness evaluation in MDS.
📝 Abstract
Fairness in multi-document summarization (MDS) measures whether a system can generate a summary fairly representing information from documents with different social attribute values. Fairness in MDS is crucial since a fair summary can offer readers a comprehensive view. Previous works focus on quantifying summary-level fairness using Proportional Representation, a fairness measure based on Statistical Parity. However, Proportional Representation does not consider redundancy in input documents and overlooks corpus-level unfairness. In this work, we propose a new summary-level fairness measure, Equal Coverage, which is based on coverage of documents with different social attribute values and considers the redundancy within documents. To detect the corpus-level unfairness, we propose a new corpus-level measure, Coverage Parity. Our human evaluations show that our measures align more with our definition of fairness. Using our measures, we evaluate the fairness of thirteen different LLMs. We find that Claude3-sonnet is the fairest among all evaluated LLMs. We also find that almost all LLMs overrepresent different social attribute values.