🤖 AI Summary
This study addresses the risk of political bias in multi-document news summarization systems, which can lead to unfair representation of viewpoints. Leveraging the FairNews dataset annotated with political orientation labels, the authors evaluate thirteen large language models across five fairness metrics and systematically assess debiasing strategies such as prompt engineering and referee-based mechanisms. The findings reveal no positive correlation between model scale and fairness, with medium-sized models achieving the best trade-off between fairness and efficiency. Prompt-based debiasing efficacy is highly dependent on model architecture, while fairness along the entity sentiment dimension proves particularly resistant to improvement—none of the tested interventions yield significant gains. The work underscores the necessity of multidimensional evaluation frameworks and architecture-aware, targeted debiasing approaches.
📝 Abstract
Multi-document news summarisation systems are increasingly adopted for their convenience in processing vast daily news content, making fairness across diverse political perspectives critical. However, these systems can exhibit political bias through unequal representation of viewpoints, disproportionate emphasis on certain perspectives, and systematic underrepresentation of minority voices. This study presents a comprehensive evaluation of such bias in multi-document news summarisation using FairNews, a dataset of complete news articles with political orientation labels, examining how large language models (LLMs) handle sources with varying political leanings across 13 models and five fairness metrics. We investigate both baseline model performance and effectiveness of various debiasing interventions, including prompt-based and judge-based approaches. Our findings challenge the assumption that larger models yield fairer outputs, as mid-sized variants consistently outperform their larger counterparts, offering the best balance of fairness and efficiency. Prompt-based debiasing proves highly model dependent, while entity sentiment emerges as the most stubborn fairness dimension, resisting all intervention strategies tested. These results demonstrate that fairness in multi-document news summarisation requires multi-dimensional evaluation frameworks and targeted, architecture-aware debiasing rather than simply scaling up.